Concurrent web crawler and sitemap generator built with Rust

These details have not been verified by PyPI

Project links

Project description

RustMapper

Concurrent web crawler and sitemap generator built with Rust. Supports up to 256 concurrent workers, sharded frontier architecture, persistent state with Write-Ahead Logging, and distributed crawling with Redis.

Installation

From PyPI

pip install rustmapper

From Source

git clone https://github.com/BenjaminSRussell/Rust-sitemap.git
cd Rust-sitemap
cargo build --release

Usage

Python

from rustmapper import Crawler

crawler = Crawler(
    start_url="https://example.com",
    workers=128,
    timeout=10
)
results = crawler.crawl()

for url_data in results:
    print(f"{url_data['url']}: {url_data['status_code']}")

Command Line

# Basic crawl
rustmapper crawl --start-url https://example.com

# With options
rustmapper crawl --start-url https://example.com --workers 128 --timeout 10

# Resume interrupted crawl
rustmapper resume --data-dir ./data

# Export sitemap
rustmapper export-sitemap --data-dir ./data --output sitemap.xml

Cargo

# Basic crawl
cargo run --release -- crawl --start-url https://example.com

# With options
cargo run --release -- crawl --start-url https://example.com --workers 128 --timeout 10

# Resume
cargo run --release -- resume --data-dir ./data

# Export sitemap
cargo run --release -- export-sitemap --data-dir ./data --output sitemap.xml

Options

Flag	Default	Description
`--start-url`	required	Starting URL
`--workers`	256	Concurrent requests
`--timeout`	20	Request timeout (seconds)
`--data-dir`	./data	Storage location
`--seeding-strategy`	all	none/sitemap/ct/commoncrawl/all
`--ignore-robots`	false	Skip robots.txt
`--enable-redis`	false	Distributed mode
`--redis-url`	-	Redis connection

Seeding Strategies

none - Only start URL
sitemap - Discover from sitemap.xml
ct - Certificate Transparency logs (finds subdomains)
commoncrawl - Query Common Crawl index
all - Use all methods

Performance

Throughput: 50-200 URLs/minute depending on page size. Network I/O bound.

Timing breakdown per URL:

Body download: 700-900ms (70-90%)
Network fetch: 50-550ms (10-20%)
Everything else: <50ms (<5%)

Recommended settings:

# Focused crawl (skip subdomains)
--timeout 10 --seeding-strategy sitemap

# University sites (avoid internal hosts)
--timeout 5 --seeding-strategy sitemap --start-url www.university.edu

# Maximum discovery
--workers 256 --timeout 10 --seeding-strategy all

Output

JSONL (automatic): ./data/sitemap.jsonl

{"url":"https://example.com/","depth":0,"status_code":200,"content_length":1024,"title":"Example","link_count":5}

XML sitemap:

rustmapper export-sitemap --data-dir ./data --output sitemap.xml

Distributed Crawling

# Instance 1
rustmapper crawl --start-url https://example.com --enable-redis --redis-url redis://localhost:6379

# Instance 2
rustmapper crawl --start-url https://example.com --enable-redis --redis-url redis://localhost:6379

Includes URL deduplication, work stealing, and distributed locks.

Architecture

Frontier: Sharded queues (14 shards), bloom filter deduplication, per-host politeness
State: Embedded redb database + WAL for crash recovery
Governor: Adaptive concurrency control based on commit latency
Workers: Async task pool with semaphore-based backpressure

Troubleshooting

Issue	Cause	Solution
Slow crawling	Large pages take ~1s to download	Network I/O bound, expected
Many timeouts	Internal/unreachable hosts (CT log discovery)	Reduce timeout: `--timeout 5` or use `--seeding-strategy sitemap`
Out of memory	Too many concurrent large pages	Reduce workers: `--workers 64`
Stops unexpectedly	Crawl completed (frontier empty)	Use `resume` to continue

Development

Build from source:

cargo build --release

Run tests:

cargo test

Documentation

Performance Analysis - Detailed timing breakdown
Bottleneck Summary - Where time is spent

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.3

Nov 8, 2025

0.1.2

Nov 8, 2025

0.1.1

Nov 8, 2025

This version

0.1.0

Nov 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rustmapper-0.1.0.tar.gz (104.7 kB view details)

Uploaded Nov 8, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rustmapper-0.1.0-cp313-cp313-macosx_11_0_arm64.whl (155.0 kB view details)

Uploaded Nov 8, 2025 CPython 3.13macOS 11.0+ ARM64

File details

Details for the file rustmapper-0.1.0.tar.gz.

File metadata

Download URL: rustmapper-0.1.0.tar.gz
Upload date: Nov 8, 2025
Size: 104.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for rustmapper-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`279c201c023741aea6355c60485eab20fdc4399438f1da908c0f2212a23c8cc4`
MD5	`32e356fa2ed55c89dd08a80e3c9bd44b`
BLAKE2b-256	`7409d7f73ff28c92e66801e5a2d78fd298b8b97febc2844b514acc36dcd6a9f6`

See more details on using hashes here.

File details

Details for the file rustmapper-0.1.0-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

Download URL: rustmapper-0.1.0-cp313-cp313-macosx_11_0_arm64.whl
Upload date: Nov 8, 2025
Size: 155.0 kB
Tags: CPython 3.13, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for rustmapper-0.1.0-cp313-cp313-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`b3c71c309bc3da586e48c0662c0b17a9311745cebfb134d2205e488276358577`
MD5	`01e2f81e0a462fa99c5df7e736444261`
BLAKE2b-256	`67d4d40e4b4b07aa341cfedf1dc56aa528f0ca87787a4f265a13118cdcb9fbef`

See more details on using hashes here.

rustmapper 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

RustMapper

Installation

From PyPI

From Source

Usage

Python

Command Line

Cargo

Options

Seeding Strategies

Performance

Output

Distributed Crawling

Architecture

Troubleshooting

Development

Documentation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes