Skip to main content

Concurrent web crawler and sitemap generator built with Rust

Project description

RustMapper

Concurrent web crawler and sitemap generator built with Rust. Supports up to 256 concurrent workers, sharded frontier architecture, persistent state with Write-Ahead Logging, and distributed crawling with Redis.

Installation

From PyPI

pip install rustmapper

From Source

git clone https://github.com/BenjaminSRussell/Rust-sitemap.git
cd Rust-sitemap
cargo build --release

Usage

Python

from rustmapper import Crawler

crawler = Crawler(
    start_url="https://example.com",
    workers=128,
    timeout=10
)
results = crawler.crawl()

for url_data in results:
    print(f"{url_data['url']}: {url_data['status_code']}")

Command Line

# Basic crawl
rustmapper crawl --start-url https://example.com

# With options
rustmapper crawl --start-url https://example.com --workers 128 --timeout 10

# Resume interrupted crawl
rustmapper resume --data-dir ./data

# Export sitemap
rustmapper export-sitemap --data-dir ./data --output sitemap.xml

Cargo

# Basic crawl
cargo run --release -- crawl --start-url https://example.com

# With options
cargo run --release -- crawl --start-url https://example.com --workers 128 --timeout 10

# Resume
cargo run --release -- resume --data-dir ./data

# Export sitemap
cargo run --release -- export-sitemap --data-dir ./data --output sitemap.xml

Options

Flag Default Description
--start-url required Starting URL
--workers 256 Concurrent requests
--timeout 20 Request timeout (seconds)
--data-dir ./data Storage location
--seeding-strategy all none/sitemap/ct/commoncrawl/all
--ignore-robots false Skip robots.txt
--enable-redis false Distributed mode
--redis-url - Redis connection

Seeding Strategies

  • none - Only start URL
  • sitemap - Discover from sitemap.xml
  • ct - Certificate Transparency logs (finds subdomains)
  • commoncrawl - Query Common Crawl index
  • all - Use all methods

Performance

Throughput: 50-200 URLs/minute depending on page size. Network I/O bound.

Timing breakdown per URL:

  • Body download: 700-900ms (70-90%)
  • Network fetch: 50-550ms (10-20%)
  • Everything else: <50ms (<5%)

Recommended settings:

# Focused crawl (skip subdomains)
--timeout 10 --seeding-strategy sitemap

# University sites (avoid internal hosts)
--timeout 5 --seeding-strategy sitemap --start-url www.university.edu

# Maximum discovery
--workers 256 --timeout 10 --seeding-strategy all

Output

JSONL (automatic): ./data/sitemap.jsonl

{"url":"https://example.com/","depth":0,"status_code":200,"content_length":1024,"title":"Example","link_count":5}

XML sitemap:

rustmapper export-sitemap --data-dir ./data --output sitemap.xml

Distributed Crawling

# Instance 1
rustmapper crawl --start-url https://example.com --enable-redis --redis-url redis://localhost:6379

# Instance 2
rustmapper crawl --start-url https://example.com --enable-redis --redis-url redis://localhost:6379

Includes URL deduplication, work stealing, and distributed locks.

Architecture

  • Frontier: Sharded queues (14 shards), bloom filter deduplication, per-host politeness
  • State: Embedded redb database + WAL for crash recovery
  • Governor: Adaptive concurrency control based on commit latency
  • Workers: Async task pool with semaphore-based backpressure

Troubleshooting

Issue Cause Solution
Slow crawling Large pages take ~1s to download Network I/O bound, expected
Many timeouts Internal/unreachable hosts (CT log discovery) Reduce timeout: --timeout 5 or use --seeding-strategy sitemap
Out of memory Too many concurrent large pages Reduce workers: --workers 64
Stops unexpectedly Crawl completed (frontier empty) Use resume to continue

Development

Build from source:

cargo build --release

Run tests:

cargo test

Documentation

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rustmapper-0.1.0.tar.gz (104.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rustmapper-0.1.0-cp313-cp313-macosx_11_0_arm64.whl (155.0 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

File details

Details for the file rustmapper-0.1.0.tar.gz.

File metadata

  • Download URL: rustmapper-0.1.0.tar.gz
  • Upload date:
  • Size: 104.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for rustmapper-0.1.0.tar.gz
Algorithm Hash digest
SHA256 279c201c023741aea6355c60485eab20fdc4399438f1da908c0f2212a23c8cc4
MD5 32e356fa2ed55c89dd08a80e3c9bd44b
BLAKE2b-256 7409d7f73ff28c92e66801e5a2d78fd298b8b97febc2844b514acc36dcd6a9f6

See more details on using hashes here.

File details

Details for the file rustmapper-0.1.0-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for rustmapper-0.1.0-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b3c71c309bc3da586e48c0662c0b17a9311745cebfb134d2205e488276358577
MD5 01e2f81e0a462fa99c5df7e736444261
BLAKE2b-256 67d4d40e4b4b07aa341cfedf1dc56aa528f0ca87787a4f265a13118cdcb9fbef

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page