Concurrent web crawler and sitemap generator built with Rust
Project description
RustMapper
Concurrent web crawler and sitemap generator built with Rust. Supports up to 256 concurrent workers, sharded frontier architecture, persistent state with Write-Ahead Logging, and distributed crawling with Redis.
Installation
From PyPI
pip install rustmapper
From Source
git clone https://github.com/BenjaminSRussell/Rust-sitemap.git
cd Rust-sitemap
cargo build --release
Usage
Python
from rustmapper import Crawler
crawler = Crawler(
start_url="https://example.com",
workers=128,
timeout=10
)
results = crawler.crawl()
for url_data in results:
print(f"{url_data['url']}: {url_data['status_code']}")
Command Line
# Basic crawl
rustmapper crawl --start-url https://example.com
# With options
rustmapper crawl --start-url https://example.com --workers 128 --timeout 10
# Resume interrupted crawl
rustmapper resume --data-dir ./data
# Export sitemap
rustmapper export-sitemap --data-dir ./data --output sitemap.xml
Cargo
# Basic crawl
cargo run --release -- crawl --start-url https://example.com
# With options
cargo run --release -- crawl --start-url https://example.com --workers 128 --timeout 10
# Resume
cargo run --release -- resume --data-dir ./data
# Export sitemap
cargo run --release -- export-sitemap --data-dir ./data --output sitemap.xml
Options
| Flag | Default | Description |
|---|---|---|
--start-url |
required | Starting URL |
--workers |
256 | Concurrent requests |
--timeout |
20 | Request timeout (seconds) |
--data-dir |
./data | Storage location |
--seeding-strategy |
all | none/sitemap/ct/commoncrawl/all |
--ignore-robots |
false | Skip robots.txt |
--enable-redis |
false | Distributed mode |
--redis-url |
- | Redis connection |
Seeding Strategies
none- Only start URLsitemap- Discover from sitemap.xmlct- Certificate Transparency logs (finds subdomains)commoncrawl- Query Common Crawl indexall- Use all methods
Performance
Throughput: 50-200 URLs/minute depending on page size. Network I/O bound.
Timing breakdown per URL:
- Body download: 700-900ms (70-90%)
- Network fetch: 50-550ms (10-20%)
- Everything else: <50ms (<5%)
Recommended settings:
# Focused crawl (skip subdomains)
--timeout 10 --seeding-strategy sitemap
# University sites (avoid internal hosts)
--timeout 5 --seeding-strategy sitemap --start-url www.university.edu
# Maximum discovery
--workers 256 --timeout 10 --seeding-strategy all
Output
JSONL (automatic): ./data/sitemap.jsonl
{"url":"https://example.com/","depth":0,"status_code":200,"content_length":1024,"title":"Example","link_count":5}
XML sitemap:
rustmapper export-sitemap --data-dir ./data --output sitemap.xml
Distributed Crawling
# Instance 1
rustmapper crawl --start-url https://example.com --enable-redis --redis-url redis://localhost:6379
# Instance 2
rustmapper crawl --start-url https://example.com --enable-redis --redis-url redis://localhost:6379
Includes URL deduplication, work stealing, and distributed locks.
Architecture
- Frontier: Sharded queues (14 shards), bloom filter deduplication, per-host politeness
- State: Embedded redb database + WAL for crash recovery
- Governor: Adaptive concurrency control based on commit latency
- Workers: Async task pool with semaphore-based backpressure
Troubleshooting
| Issue | Cause | Solution |
|---|---|---|
| Slow crawling | Large pages take ~1s to download | Network I/O bound, expected |
| Many timeouts | Internal/unreachable hosts (CT log discovery) | Reduce timeout: --timeout 5 or use --seeding-strategy sitemap |
| Out of memory | Too many concurrent large pages | Reduce workers: --workers 64 |
| Stops unexpectedly | Crawl completed (frontier empty) | Use resume to continue |
Development
Build from source:
cargo build --release
Run tests:
cargo test
Documentation
- Performance Analysis - Detailed timing breakdown
- Bottleneck Summary - Where time is spent
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rustmapper-0.1.0.tar.gz.
File metadata
- Download URL: rustmapper-0.1.0.tar.gz
- Upload date:
- Size: 104.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
279c201c023741aea6355c60485eab20fdc4399438f1da908c0f2212a23c8cc4
|
|
| MD5 |
32e356fa2ed55c89dd08a80e3c9bd44b
|
|
| BLAKE2b-256 |
7409d7f73ff28c92e66801e5a2d78fd298b8b97febc2844b514acc36dcd6a9f6
|
File details
Details for the file rustmapper-0.1.0-cp313-cp313-macosx_11_0_arm64.whl.
File metadata
- Download URL: rustmapper-0.1.0-cp313-cp313-macosx_11_0_arm64.whl
- Upload date:
- Size: 155.0 kB
- Tags: CPython 3.13, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b3c71c309bc3da586e48c0662c0b17a9311745cebfb134d2205e488276358577
|
|
| MD5 |
01e2f81e0a462fa99c5df7e736444261
|
|
| BLAKE2b-256 |
67d4d40e4b4b07aa341cfedf1dc56aa528f0ca87787a4f265a13118cdcb9fbef
|