Skip to main content

Crawls and indexes websites for local LLM work

Project description

SmolCrawl

A lightweight web crawler and indexer for creating searchable document collections from websites.

Overview

SmolCrawl is a Python-based tool that helps you:

  • Crawl websites and extract content
  • Convert HTML content to readable markdown
  • Index pages for efficient searching
  • Query indexed content with relevance scoring

Perfect for creating local knowledge bases, documentation search, or personal research collections.

Features

  • Simple Web Crawling: Easily crawl and extract content from target websites
  • Content Extraction: Automatically extracts meaningful content from HTML using readability algorithms
  • Markdown Conversion: Converts HTML content to clean, readable markdown format
  • Fast Indexing: Uses Tantivy (Rust-based search library) for performant full-text search
  • Caching: Implements disk-based caching to avoid redundant crawling
  • CLI Interface: Simple command-line interface for all operations

Installation

# Clone the repository
git clone https://github.com/yourusername/smolcrawl.git
cd smolcrawl

# Install the package
pip install -e .

Requirements

  • Python 3.11 or higher
  • Dependencies are automatically installed with the package

Usage

Crawl a Website

smolcrawl crawl https://example.com

Index a Website

smolcrawl index https://example.com my_index_name

List Available Indices

smolcrawl list_indices

Query an Index

smolcrawl query my_index_name "your search query" --limit 10 --score_threshold 0.5

Delete an Index

smolcrawl delete_index my_index_name

Configuration

SmolCrawl uses environment variables for configuration:

  • STORAGE_PATH: Path to store data (default: ./data)
  • CACHE_PATH: Path for caching (default: ./data/cache)

You can set these in a .env file in the project root.

Project Structure

smolcrawl/
├── src/smolcrawl/
│   ├── __init__.py    # CLI and entry points
│   ├── crawl.py       # Web crawling functionality
│   ├── db.py          # Indexing and search functionality
│   └── utils.py       # Utility functions
├── data/              # Storage for indices and cache (gitignored)
├── .gitignore
└── pyproject.toml     # Project metadata and dependencies

How It Works

  1. Crawling: Uses BeautifulSoupCrawler to fetch web pages and extract links
  2. Content Processing: Extracts meaningful content using ReadabiliPy and converts to markdown
  3. Indexing: Stores extracted content in a Tantivy index for efficient searching
  4. Searching: Performs full-text search on indexed content with relevance ranking

Responsible Crawling

SmolCrawl is a powerful tool, and with great power comes great responsibility. When crawling websites, please be mindful and respectful of the website owners and their resources.

  • Check robots.txt: Always check a website's robots.txt file (https://example.com/robots.txt) before crawling. Respect the rules outlined there regarding which paths are allowed or disallowed for crawling.
  • Rate Limiting: Avoid overwhelming the target server with too many requests in a short period. Implement delays between requests if necessary (SmolCrawl does not currently have built-in rate limiting).
  • Identify Yourself: Consider setting a descriptive User-Agent string to identify your crawler, although SmolCrawl does not currently support custom User-Agents.
  • Crawl During Off-Peak Hours: If possible, schedule crawls during times when the website is likely to have lower traffic.
  • Use Caching: Take advantage of SmolCrawl's caching feature to avoid re-downloading content unnecessarily.

Misusing web crawlers can lead to your IP address being blocked and can negatively impact the performance and availability of the website for others. Use SmolCrawl ethically and responsibly.

License

[Your License Choice]

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smolcrawl-0.1.7.tar.gz (10.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

smolcrawl-0.1.7-py3-none-any.whl (9.7 kB view details)

Uploaded Python 3

File details

Details for the file smolcrawl-0.1.7.tar.gz.

File metadata

  • Download URL: smolcrawl-0.1.7.tar.gz
  • Upload date:
  • Size: 10.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for smolcrawl-0.1.7.tar.gz
Algorithm Hash digest
SHA256 a61405a04333bb213005c1a8cbbf92c929a36e7db105761f49bd6ecffe8e3592
MD5 24af8c37d7205aef01ad78c2893635d8
BLAKE2b-256 75e535dcd00c95fb338909b7c5a261c8537d35854ed623e9a4a4806a2f4297b0

See more details on using hashes here.

Provenance

The following attestation bundles were made for smolcrawl-0.1.7.tar.gz:

Publisher: pypi.yml on bllchmbrs/smolcrawl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file smolcrawl-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: smolcrawl-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 9.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for smolcrawl-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 8549709efd0ddd631f207204a0337caba3f2c01cb79a668ffcadd7b117c76ef1
MD5 5adcf61c9ada94ca9b0d951e9cd40268
BLAKE2b-256 58c92b93c956e9242f911fe10e07fea4f605ed12841792b58a454ed15a460f75

See more details on using hashes here.

Provenance

The following attestation bundles were made for smolcrawl-0.1.7-py3-none-any.whl:

Publisher: pypi.yml on bllchmbrs/smolcrawl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page