Skip to main content

A web crawler that converts web pages to markdown and prepares them for LLM consumption

Project description

TezzCrawler

A powerful web crawler that converts web pages to markdown format, making them ready for LLM consumption.

Features

  • Single page scraping with markdown conversion
  • Full website crawling using sitemap.xml
  • Proxy support for web scraping
  • Simple CLI interface
  • Easy to use as a Python package

Installation

pip install TezzCrawler

Usage

Command Line Interface

  1. Scrape a single page:
tezzcrawler scrape-page https://example.com --output ./output
  1. Crawl from sitemap:
tezzcrawler crawl-from-sitemap https://example.com/sitemap.xml --output ./output
  1. Using with proxy:
tezzcrawler scrape-page https://example.com \
    --proxy-url proxy.example.com \
    --proxy-port 8080 \
    --proxy-username user \
    --proxy-password pass \
    --output ./output

Python Package

from tezzcrawler import Scraper, Crawler
from pathlib import Path

# Scrape a single page
scraper = Scraper()
scraper.scrape_page("https://example.com", Path("./output"))

# Crawl from sitemap
crawler = Crawler()
crawler.crawl_sitemap("https://example.com/sitemap.xml", Path("./output"))

# With proxy configuration
scraper = Scraper(
    proxy_url="proxy.example.com",
    proxy_port=8080,
    proxy_username="user",
    proxy_password="pass"
)

Development

  1. Clone the repository:
git clone https://github.com/TezzLabs/TezzCrawler.git
cd TezzCrawler
  1. Install development dependencies:
pip install -e ".[dev]"

License

MIT License - see LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tezzcrawler-0.3.1.tar.gz (5.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

TezzCrawler-0.3.1-py3-none-any.whl (8.1 kB view details)

Uploaded Python 3

File details

Details for the file tezzcrawler-0.3.1.tar.gz.

File metadata

  • Download URL: tezzcrawler-0.3.1.tar.gz
  • Upload date:
  • Size: 5.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.11.11

File hashes

Hashes for tezzcrawler-0.3.1.tar.gz
Algorithm Hash digest
SHA256 52ae2fb799947aabdb0d39ebc86ea2ccdbbd7f288dc835a44f648fbfb03a7d5c
MD5 c8db369c4410ce95bcaad1cd979b9522
BLAKE2b-256 590bc8ab9351e23e6e937f0baf6478d6caef025fbf48f329d3588b216dd70e1d

See more details on using hashes here.

File details

Details for the file TezzCrawler-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: TezzCrawler-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 8.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.11.11

File hashes

Hashes for TezzCrawler-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1cce7b954a6e2cea64ef001bb890ec27ef9ec482bc0a1e23b561b66ede45aaf3
MD5 5e279a6d2beb9311d910cdf8c0afb2e3
BLAKE2b-256 3b9775f47d63aaab2909ce71d82aa729bdba98c2f1761ceba0cb841f4eef4200

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page