Skip to main content

Concurrent web crawler and sitemap generator built with Rust

Project description

RustMapper

Concurrent web crawler and sitemap generator built with Rust. Supports up to 256 concurrent workers, sharded frontier architecture, persistent state with Write-Ahead Logging, and distributed crawling with Redis.

Installation

From PyPI

pip install rustmapper

From Source

git clone https://github.com/BenjaminSRussell/Rust-sitemap.git
cd Rust-sitemap
cargo build --release

Usage

Python

from rustmapper import Crawler

crawler = Crawler(
    start_url="https://example.com",
    workers=128,
    timeout=10
)
results = crawler.crawl()

for url_data in results:
    print(f"{url_data['url']}: {url_data['status_code']}")

Command Line

# Basic crawl
rustmapper crawl --start-url https://example.com

# With options
rustmapper crawl --start-url https://example.com --workers 128 --timeout 10

# Resume interrupted crawl
rustmapper resume --data-dir ./data

# Export sitemap
rustmapper export-sitemap --data-dir ./data --output sitemap.xml

Cargo

# Basic crawl
cargo run --release -- crawl --start-url https://example.com

# With options
cargo run --release -- crawl --start-url https://example.com --workers 128 --timeout 10

# Resume
cargo run --release -- resume --data-dir ./data

# Export sitemap
cargo run --release -- export-sitemap --data-dir ./data --output sitemap.xml

Options

Flag Default Description
--start-url required Starting URL
--workers 256 Concurrent requests
--timeout 20 Request timeout (seconds)
--data-dir ./data Storage location
--seeding-strategy all none/sitemap/ct/commoncrawl/all
--ignore-robots false Skip robots.txt
--enable-redis false Distributed mode
--redis-url - Redis connection

Seeding Strategies

  • none - Only start URL
  • sitemap - Discover from sitemap.xml
  • ct - Certificate Transparency logs (finds subdomains)
  • commoncrawl - Query Common Crawl index
  • all - Use all methods

Performance

Throughput: 50-200 URLs/minute depending on page size. Network I/O bound.

Timing breakdown per URL:

  • Body download: 700-900ms (70-90%)
  • Network fetch: 50-550ms (10-20%)
  • Everything else: <50ms (<5%)

Recommended settings:

# Focused crawl (skip subdomains)
--timeout 10 --seeding-strategy sitemap

# University sites (avoid internal hosts)
--timeout 5 --seeding-strategy sitemap --start-url www.university.edu

# Maximum discovery
--workers 256 --timeout 10 --seeding-strategy all

Output

JSONL (automatic): ./data/sitemap.jsonl

{"url":"https://example.com/","depth":0,"status_code":200,"content_length":1024,"title":"Example","link_count":5}

XML sitemap:

rustmapper export-sitemap --data-dir ./data --output sitemap.xml

Distributed Crawling

# Instance 1
rustmapper crawl --start-url https://example.com --enable-redis --redis-url redis://localhost:6379

# Instance 2
rustmapper crawl --start-url https://example.com --enable-redis --redis-url redis://localhost:6379

Includes URL deduplication, work stealing, and distributed locks.

Architecture

  • Frontier: Sharded queues (14 shards), bloom filter deduplication, per-host politeness
  • State: Embedded redb database + WAL for crash recovery
  • Governor: Adaptive concurrency control based on commit latency
  • Workers: Async task pool with semaphore-based backpressure

Troubleshooting

Issue Cause Solution
Slow crawling Large pages take ~1s to download Network I/O bound, expected
Many timeouts Internal/unreachable hosts (CT log discovery) Reduce timeout: --timeout 5 or use --seeding-strategy sitemap
Out of memory Too many concurrent large pages Reduce workers: --workers 64
Stops unexpectedly Crawl completed (frontier empty) Use resume to continue

Development

Build from source:

cargo build --release

Run tests:

cargo test

Documentation

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rustmapper-0.1.1.tar.gz (103.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rustmapper-0.1.1-cp313-cp313-macosx_11_0_arm64.whl (155.1 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

File details

Details for the file rustmapper-0.1.1.tar.gz.

File metadata

  • Download URL: rustmapper-0.1.1.tar.gz
  • Upload date:
  • Size: 103.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for rustmapper-0.1.1.tar.gz
Algorithm Hash digest
SHA256 ba2bba5cc95a62a72964c6e3ee5e2e0e2dd3fcdfae74adb16ecc5911a1fa696a
MD5 afdee3ef9af1f85648af9aed1d2f86db
BLAKE2b-256 747e43d9bc3138e86dacede981358fe27c3cb4bfee668e3f4e435e540ee5bb07

See more details on using hashes here.

File details

Details for the file rustmapper-0.1.1-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for rustmapper-0.1.1-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9b8d386d47800a85187454d9f49043e591ecc4641db3520212da1373e058e42f
MD5 889cf1003f269e807197c18c8daf4d75
BLAKE2b-256 d6b0965167dcb7fe3378da002c6cb91d37fc1ce0e9b76d4d4c5d9596077ad12e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page