Crawls and indexes websites for local LLM work

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

bllchmbrs

These details have not been verified by PyPI

Project description

SmolCrawl

A lightweight web crawler and indexer for creating searchable document collections from websites.

Overview

SmolCrawl is a Python-based tool that helps you:

Crawl websites and extract content
Convert HTML content to readable markdown
Index pages for efficient searching
Query indexed content with relevance scoring

Perfect for creating local knowledge bases, documentation search, or personal research collections.

Features

Simple Web Crawling: Easily crawl and extract content from target websites
Content Extraction: Automatically extracts meaningful content from HTML using readability algorithms
Markdown Conversion: Converts HTML content to clean, readable markdown format
Fast Indexing: Uses Tantivy (Rust-based search library) for performant full-text search
Caching: Implements disk-based caching to avoid redundant crawling
CLI Interface: Simple command-line interface for all operations

Installation

# Clone the repository
git clone https://github.com/yourusername/smolcrawl.git
cd smolcrawl

# Install the package
pip install -e .

Requirements

Python 3.11 or higher
Dependencies are automatically installed with the package

Usage

Crawl a Website

smolcrawl crawl https://example.com

Index a Website

smolcrawl index https://example.com my_index_name

List Available Indices

smolcrawl list_indices

Query an Index

smolcrawl query my_index_name "your search query" --limit 10 --score_threshold 0.5

Delete an Index

smolcrawl delete_index my_index_name

Configuration

SmolCrawl uses environment variables for configuration:

STORAGE_PATH: Path to store data (default: ./data)
CACHE_PATH: Path for caching (default: ./data/cache)

You can set these in a .env file in the project root.

Project Structure

smolcrawl/
├── src/smolcrawl/
│   ├── __init__.py    # CLI and entry points
│   ├── crawl.py       # Web crawling functionality
│   ├── db.py          # Indexing and search functionality
│   └── utils.py       # Utility functions
├── data/              # Storage for indices and cache (gitignored)
├── .gitignore
└── pyproject.toml     # Project metadata and dependencies

How It Works

Crawling: Uses BeautifulSoupCrawler to fetch web pages and extract links
Content Processing: Extracts meaningful content using ReadabiliPy and converts to markdown
Indexing: Stores extracted content in a Tantivy index for efficient searching
Searching: Performs full-text search on indexed content with relevance ranking

Responsible Crawling

SmolCrawl is a powerful tool, and with great power comes great responsibility. When crawling websites, please be mindful and respectful of the website owners and their resources.

Check robots.txt: Always check a website's robots.txt file (https://example.com/robots.txt) before crawling. Respect the rules outlined there regarding which paths are allowed or disallowed for crawling.
Rate Limiting: Avoid overwhelming the target server with too many requests in a short period. Implement delays between requests if necessary (SmolCrawl does not currently have built-in rate limiting).
Identify Yourself: Consider setting a descriptive User-Agent string to identify your crawler, although SmolCrawl does not currently support custom User-Agents.
Crawl During Off-Peak Hours: If possible, schedule crawls during times when the website is likely to have lower traffic.
Use Caching: Take advantage of SmolCrawl's caching feature to avoid re-downloading content unnecessarily.

Misusing web crawlers can lead to your IP address being blocked and can negatively impact the performance and availability of the website for others. Use SmolCrawl ethically and responsibly.

License

[Your License Choice]

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

bllchmbrs

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.7

Apr 24, 2025

0.1.6

Apr 24, 2025

0.1.5

Apr 24, 2025

0.1.4

Apr 24, 2025

0.1.3

Apr 24, 2025

0.1.2

Apr 24, 2025

0.1.1

Apr 24, 2025

0.1.0

Apr 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smolcrawl-0.1.7.tar.gz (10.4 kB view details)

Uploaded Apr 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

smolcrawl-0.1.7-py3-none-any.whl (9.7 kB view details)

Uploaded Apr 24, 2025 Python 3

File details

Details for the file smolcrawl-0.1.7.tar.gz.

File metadata

Download URL: smolcrawl-0.1.7.tar.gz
Upload date: Apr 24, 2025
Size: 10.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for smolcrawl-0.1.7.tar.gz
Algorithm	Hash digest
SHA256	`a61405a04333bb213005c1a8cbbf92c929a36e7db105761f49bd6ecffe8e3592`
MD5	`24af8c37d7205aef01ad78c2893635d8`
BLAKE2b-256	`75e535dcd00c95fb338909b7c5a261c8537d35854ed623e9a4a4806a2f4297b0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for smolcrawl-0.1.7.tar.gz:

Publisher: pypi.yml on bllchmbrs/smolcrawl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: smolcrawl-0.1.7.tar.gz
- Subject digest: a61405a04333bb213005c1a8cbbf92c929a36e7db105761f49bd6ecffe8e3592
- Sigstore transparency entry: 202390374
- Sigstore integration time: Apr 24, 2025
Source repository:
- Permalink: bllchmbrs/smolcrawl@7d4b19bc5010e250cc9ad5fdec5ab893a00226ff
- Branch / Tag: refs/tags/v0.1.7
- Owner: https://github.com/bllchmbrs
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi.yml@7d4b19bc5010e250cc9ad5fdec5ab893a00226ff
- Trigger Event: release

File details

Details for the file smolcrawl-0.1.7-py3-none-any.whl.

File metadata

Download URL: smolcrawl-0.1.7-py3-none-any.whl
Upload date: Apr 24, 2025
Size: 9.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for smolcrawl-0.1.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8549709efd0ddd631f207204a0337caba3f2c01cb79a668ffcadd7b117c76ef1`
MD5	`5adcf61c9ada94ca9b0d951e9cd40268`
BLAKE2b-256	`58c92b93c956e9242f911fe10e07fea4f605ed12841792b58a454ed15a460f75`

See more details on using hashes here.

Provenance

The following attestation bundles were made for smolcrawl-0.1.7-py3-none-any.whl:

Publisher: pypi.yml on bllchmbrs/smolcrawl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: smolcrawl-0.1.7-py3-none-any.whl
- Subject digest: 8549709efd0ddd631f207204a0337caba3f2c01cb79a668ffcadd7b117c76ef1
- Sigstore transparency entry: 202390380
- Sigstore integration time: Apr 24, 2025
Source repository:
- Permalink: bllchmbrs/smolcrawl@7d4b19bc5010e250cc9ad5fdec5ab893a00226ff
- Branch / Tag: refs/tags/v0.1.7
- Owner: https://github.com/bllchmbrs
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi.yml@7d4b19bc5010e250cc9ad5fdec5ab893a00226ff
- Trigger Event: release

smolcrawl 0.1.7

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

SmolCrawl

Overview

Features

Installation

Requirements

Usage

Crawl a Website

Index a Website

List Available Indices

Query an Index

Delete an Index

Configuration

Project Structure

How It Works

Responsible Crawling

License

Contributing

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance