Crawls and indexes websites
Project description
SmolCrawl
A lightweight web crawler and indexer for creating searchable document collections from websites.
Overview
SmolCrawl is a Python-based tool that helps you:
- Crawl websites and extract content
- Convert HTML content to readable markdown
- Index pages for efficient searching
- Query indexed content with relevance scoring
Perfect for creating local knowledge bases, documentation search, or personal research collections.
Features
- Simple Web Crawling: Easily crawl and extract content from target websites
- Content Extraction: Automatically extracts meaningful content from HTML using readability algorithms
- Markdown Conversion: Converts HTML content to clean, readable markdown format
- Fast Indexing: Uses Tantivy (Rust-based search library) for performant full-text search
- Caching: Implements disk-based caching to avoid redundant crawling
- CLI Interface: Simple command-line interface for all operations
Installation
# Clone the repository
git clone https://github.com/yourusername/smolcrawl.git
cd smolcrawl
# Install the package
pip install -e .
Requirements
- Python 3.11 or higher
- Dependencies are automatically installed with the package
Usage
Crawl a Website
smolcrawl crawl https://example.com
Index a Website
smolcrawl index https://example.com my_index_name
List Available Indices
smolcrawl list_indices
Query an Index
smolcrawl query my_index_name "your search query" --limit 10 --score_threshold 0.5
Delete an Index
smolcrawl delete_index my_index_name
Configuration
SmolCrawl uses environment variables for configuration:
STORAGE_PATH: Path to store data (default:./data)CACHE_PATH: Path for caching (default:./data/cache)
You can set these in a .env file in the project root.
Project Structure
smolcrawl/
├── src/smolcrawl/
│ ├── __init__.py # CLI and entry points
│ ├── crawl.py # Web crawling functionality
│ ├── db.py # Indexing and search functionality
│ └── utils.py # Utility functions
├── data/ # Storage for indices and cache (gitignored)
├── .gitignore
└── pyproject.toml # Project metadata and dependencies
How It Works
- Crawling: Uses BeautifulSoupCrawler to fetch web pages and extract links
- Content Processing: Extracts meaningful content using ReadabiliPy and converts to markdown
- Indexing: Stores extracted content in a Tantivy index for efficient searching
- Searching: Performs full-text search on indexed content with relevance ranking
License
[Your License Choice]
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file smolcrawl-0.1.2.tar.gz.
File metadata
- Download URL: smolcrawl-0.1.2.tar.gz
- Upload date:
- Size: 103.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
de7002fc184ee78314682c76ebf3f4782bbad6ba948d5f833e9d70b7edd8bd58
|
|
| MD5 |
ded4204608ed71260fe978c49676dbe8
|
|
| BLAKE2b-256 |
4e234ee54c1babb2934c6996a17794fe88529ccc3fc8de905297775f8d954e82
|
Provenance
The following attestation bundles were made for smolcrawl-0.1.2.tar.gz:
Publisher:
pypi.yml on bllchmbrs/smolcrawl
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
smolcrawl-0.1.2.tar.gz -
Subject digest:
de7002fc184ee78314682c76ebf3f4782bbad6ba948d5f833e9d70b7edd8bd58 - Sigstore transparency entry: 202337363
- Sigstore integration time:
-
Permalink:
bllchmbrs/smolcrawl@a5141afe36bc595cff76d929c368a539d43f71c5 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/bllchmbrs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi.yml@a5141afe36bc595cff76d929c368a539d43f71c5 -
Trigger Event:
release
-
Statement type:
File details
Details for the file smolcrawl-0.1.2-py3-none-any.whl.
File metadata
- Download URL: smolcrawl-0.1.2-py3-none-any.whl
- Upload date:
- Size: 7.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fdbb84fada948d051f2cfab2ebc0473f3212faaa688b593298545129b7ec3b54
|
|
| MD5 |
fdc3f655bffa9470bce9d47957993938
|
|
| BLAKE2b-256 |
22b52f3d4c3400fb6b50f055816b506623469363a67106a304ccccef81b71c95
|
Provenance
The following attestation bundles were made for smolcrawl-0.1.2-py3-none-any.whl:
Publisher:
pypi.yml on bllchmbrs/smolcrawl
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
smolcrawl-0.1.2-py3-none-any.whl -
Subject digest:
fdbb84fada948d051f2cfab2ebc0473f3212faaa688b593298545129b7ec3b54 - Sigstore transparency entry: 202337368
- Sigstore integration time:
-
Permalink:
bllchmbrs/smolcrawl@a5141afe36bc595cff76d929c368a539d43f71c5 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/bllchmbrs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi.yml@a5141afe36bc595cff76d929c368a539d43f71c5 -
Trigger Event:
release
-
Statement type: