Async web crawler with rate limiting, robots.txt support, and broken link tracking

Reason this release was yanked:

renamed internals

Project description

WebCrawler

Lightweight async web crawler for link analysis and HTML document processing.

Perfect for: Site structure analysis, link tracking, concurrent page fetching, HTML document transformation.

Not: A replacement for Scrapy. Use this when you need simple, focused crawling with automatic link classification and clean document models.

Key Features

⚡ Async/await native — Built on asyncio + aiohttp for concurrent requests
🔗 Automatic link classification — Distinguishes internal vs external links by domain
📄 Rich document model — Full HTML source, parsed links, metadata, headers
🔄 Persistent sessions — Connection pooling for 10-100x faster same-domain crawls
🔁 Retries + backoff — Exponential backoff for transient errors (timeouts, 5xx)
⏱️ Rate limiting — Per-domain rate limiting with asyncio.Lock, no thundering herd
🤖 robots.txt support — Automatically respect Crawl-delay directives per domain
🔍 Broken link tracking — Audit 404s and 5xx errors for site structure validation
💾 Optional caching — Disk-based cache (1-day TTL) for repeat crawls
🔐 SSL verification — Secure by default, with corporate proxy support
🍪 Automatic cookies — Set-Cookie extraction and sending built-in
🔀 Traversal strategies — BFS (broad) or DFS (deep) crawling
📊 Multi-format export — JSON, Pandas, Polars, PyArrow for data analysis
📍 Callbacks & streaming — Process results as crawled without memory buildup

Quick Start

import asyncio
from WebCrawler import Spider

async def main():
    spider = Spider(start_url="https://example.com", max_depth=2)
    documents = await spider.run_async()
    
    for doc in documents:
        print(f"{doc.url}")
        print(f"  Internal links: {len(doc.internal_links)}")
        print(f"  External links: {len(doc.external_links)}")

asyncio.run(main())

Installation

pip install webcrawler

Optional export formats:

pip install webcrawler[serializers]  # pandas + polars + pyarrow
pip install webcrawler[pandas]       # Just pandas

Core Concepts

Spider

High-level orchestrator that crawls multiple pages using BFS (breadth-first) or DFS (depth-first) traversal.

Crawler

Low-level engine that fetches and parses individual documents. Handles retries, caching, SSL, cookies, sessions.

Document

Rich object containing:

url — page URL
title — HTML title tag
source — raw HTML
internal_links — links to same domain
external_links — links to other domains
status_code, response_headers, domain — metadata

See Core Concepts for more.

Configuration

Basic Crawl

spider = Spider(
    start_url="https://example.com",
    max_depth=3,              # How deep to follow links
    traversal_strategy="bfs"  # "bfs" (default) or "dfs"
)
documents = await spider.run_async()

Retries & Timeouts

spider = Spider(
    start_url="https://example.com",
    request_timeout=15,       # Seconds per request (default: 30)
    max_retries=5,            # Retry transient errors (default: 3)
)

Caching

spider = Spider(
    start_url="https://example.com",
    cache_dir=".webcrawler_cache"  # Enable disk caching (default: None/disabled)
)
# 2nd run will be 10-50x faster for same URLs

SSL & Corporate Proxies

# Default: verify SSL with system CA
spider = Spider(start_url="https://example.com")

# Corporate proxy with custom CA bundle
spider = Spider(
    start_url="https://example.com",
    ssl_verify="/path/to/corporate-ca.pem"
)

# Self-signed certs (testing only)
spider = Spider(
    start_url="https://example.com",
    ssl_verify=False  # ⚠️ Insecure
)

Cookies are handled automatically — no configuration needed.

Callbacks: Process Results in Real-Time

For large crawls, avoid memory buildup by processing documents as they're crawled:

# Stream results to disk
async def save_result(doc):
    with open("results.jsonl", "a") as f:
        f.write(json.dumps({"url": doc.url, "title": doc.title}) + "\n")

spider = Spider(
    start_url="https://example.com",
    on_page_crawled=save_result,
    accumulate_results=False,  # Don't keep in memory
)
await spider.run_async()  # Returns [], file has results

Callback Hooks:

on_page_crawled(doc) — Called after each successful crawl. Return value accumulated if accumulate_results=True
on_error(url, exc) — Called on crawl failures
on_crawl_complete() — Called when crawl finishes (cleanup hook)

Async Callbacks Supported:

async def save_to_db(doc):
    await db.insert(doc.url, doc.title)
    return doc.url

spider = Spider(
    start_url="https://example.com",
    on_page_crawled=save_to_db,       # Async callback
    accumulate_results=True,
)
results = await spider.run_async()  # Returns list of URLs

Return Logic:

No callback → returns all documents (default)
Callback + accumulate_results=False → returns [] (streaming mode)
Callback + accumulate_results=True → returns callback results

Traversal Strategies

BFS (Breadth-First) — Default

# Explores level by level: all depth-1 links, then depth-2, etc.
spider = Spider(start_url="https://example.com", max_depth=3, traversal_strategy="bfs")

DFS (Depth-First)

# Follows single paths all the way down before exploring siblings
spider = Spider(start_url="https://example.com", max_depth=5, traversal_strategy="dfs")

Use DFS for deep hierarchies (documentation sites, nested directories). Use BFS for broad exploration.

Rate Limiting & robots.txt

By default, WebCrawler automatically respects robots.txt Crawl-delay directives and enforces per-domain rate limiting:

# Automatic robots.txt respect (default)
spider = Spider(
    start_url="https://example.com",
    user_agent="MyBot/1.0",  # Identifies your bot to robots.txt rules
)
await spider.run_async()

Customize rate limiting:

# Enforce explicit delay (ignores robots.txt)
spider = Spider(
    start_url="https://example.com",
    request_delay=1.0,           # 1 second between requests to same domain
    respect_robots_txt=False,    # Don't fetch robots.txt
)

# Concurrent requests to different domains, serialized to same domain
await spider.run_async()

Broken Link Audit

Track 404s and 5xx errors for site maintenance:

spider = Spider(start_url="https://example.com", max_depth=2)
documents = await spider.run_async()

for doc in documents:
    # Broken internal links (fix these first!)
    for broken in doc.broken_internal_links:
        print(f"{doc.url} → {broken.url} (HTTP {broken.status_code})")
    
    # Broken external links (check if still valid)
    for broken in doc.broken_external_links:
        print(f"External: {broken.url} (HTTP {broken.status_code})")

Stream broken links in real-time:

async def audit_broken(doc):
    broken_count = len(doc.broken_internal_links) + len(doc.broken_external_links)
    if broken_count > 0:
        print(f"{doc.url}: {broken_count} broken links")

spider = Spider(
    start_url="https://example.com",
    on_page_crawled=audit_broken,
    accumulate_results=False,
)
await spider.run_async()

Export Data

from WebCrawler import Spider, Serializers

spider = Spider(start_url="https://example.com", max_depth=2)
documents = await spider.run_async()

# Export to JSON
serializer = Serializers(documents)
serializer.to_json("crawl.json", include_html=False)

# Export to Pandas (one row per link)
df = serializer.to_pandas()
print(df[["url", "title", "link_url", "link_type"]])

# Export to Polars (faster for large datasets)
df_polars = serializer.to_polars()

# Export to PyArrow (for data pipelines)
table = serializer.to_arrow()

Link Analysis

from collections import Counter

spider = Spider(start_url="https://example.com", max_depth=2)
documents = await spider.run_async()

# Count external domains
external_domains = Counter()
for doc in documents:
    for link in doc.external_links:
        domain = link.url.split("/")[2]
        external_domains[domain] += 1

print(external_domains.most_common(10))

See Examples for more patterns.

Notebooks

Interactive examples in notebooks/:

crawl_cnn.ipynb — Crawls CNN.com, analyzes link structure, demonstrates all export formats

API Reference

See API Reference for complete method documentation.

Troubleshooting

"SSL: CERTIFICATE_VERIFY_FAILED"

Use ssl_verify=False for self-signed certs (testing only), or ssl_verify="/path/to/ca.pem" for corporate proxies.

"Too many connections"

Reduce concurrency by lowering max_retries or increase timeouts. Default settings are conservative.

"Crawler hits timeout on deep sites"

Try DFS traversal instead of BFS, or increase request_timeout.

See Troubleshooting for more.

Performance

Typical performance (single-domain crawl):

First run: ~50-500ms per page (network-bound)
Cached run: ~1-10ms per page (2-50x faster)
Memory: ~1MB per 100 pages

With persistent sessions + connection pooling, same-domain requests are 10-100x faster than per-request session setup.

Architecture

Spider (orchestrator)
  └─ Crawler (persistent session)
      ├─ aiohttp (HTTP requests + connection pooling)
      ├─ lxml (HTML parsing)
      ├─ ResponseCache (optional disk caching)
      └─ CookieJar (automatic cookie handling)

Spider manages the crawl queue and traversal. Crawler handles individual document fetching/parsing. All requests share one persistent aiohttp session per Spider instance.

Why WebCrawler?

vs Scrapy: Lightweight, focused, simpler API for link analysis. Scrapy is better for complex extraction pipelines.

vs requests + BeautifulSoup: Built-in async concurrency, automatic session reuse, retries, caching. Better for crawling multiple pages.

vs Selenium: Pure HTTP crawler (no JS execution). Faster, lighter, but can't handle dynamic sites.

Testing

just test          # Run all tests
just test-cov      # Run with coverage report

All 91 tests pass. 100% of core crawling paths tested (rate limiting, broken link tracking, robots.txt, callbacks).

Contributing

Bug reports and pull requests welcome on GitHub.

License

MIT

Documentation:

Project details

Release history Release notifications | RSS feed

0.2.15

Jun 11, 2026

0.2.14

Jun 11, 2026

0.2.13

Jun 10, 2026

0.2.12

Jun 9, 2026

0.2.10

Jun 9, 2026

0.2.9

Jun 9, 2026

0.2.8

Jun 9, 2026

0.2.7

Jun 9, 2026

0.2.6

Jun 9, 2026

0.2.5

Jun 9, 2026

0.2.3

Jun 8, 2026

0.2.1

Jun 8, 2026

0.2.0

Jun 8, 2026

0.1.2

Jun 8, 2026

This version

0.1.0 yanked

Jun 8, 2026

Reason this release was yanked:

renamed internals

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

linktrace-0.1.0.tar.gz (267.5 kB view details)

Uploaded Jun 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

linktrace-0.1.0-py3-none-any.whl (16.5 kB view details)

Uploaded Jun 8, 2026 Python 3

File details

Details for the file linktrace-0.1.0.tar.gz.

File metadata

Download URL: linktrace-0.1.0.tar.gz
Upload date: Jun 8, 2026
Size: 267.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for linktrace-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`7ede2d93cf98e9fde7220fade966a50bf737fbaaec8f7bbaa45e4067f80c1979`
MD5	`0dde3f7a26e41bcee23cbd516bdfaca9`
BLAKE2b-256	`364dc1700806fe7c02d2ad09f20ec91b45940b4e24bae6978974c966514841bb`

See more details on using hashes here.

Provenance

The following attestation bundles were made for linktrace-0.1.0.tar.gz:

Publisher: publish.yml on JayBaywatch/webcrawler

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: linktrace-0.1.0.tar.gz
- Subject digest: 7ede2d93cf98e9fde7220fade966a50bf737fbaaec8f7bbaa45e4067f80c1979
- Sigstore transparency entry: 1758378982
- Sigstore integration time: Jun 8, 2026
Source repository:
- Permalink: JayBaywatch/webcrawler@3277e3ad0a3dce087b7b78626adae1000db96b50
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/JayBaywatch
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@3277e3ad0a3dce087b7b78626adae1000db96b50
- Trigger Event: release

File details

Details for the file linktrace-0.1.0-py3-none-any.whl.

File metadata

Download URL: linktrace-0.1.0-py3-none-any.whl
Upload date: Jun 8, 2026
Size: 16.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for linktrace-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b15767ec8a8190123e43098496c020294ec6b70056c93ec7923fe6eca585b9ed`
MD5	`0cea8db4966845e22eaafa01e50f1260`
BLAKE2b-256	`87ab4fb77ec20664733a5ea36e5bd7eeeb38e78e430e1998a99ec55f4e32f633`

See more details on using hashes here.

Provenance

The following attestation bundles were made for linktrace-0.1.0-py3-none-any.whl:

Publisher: publish.yml on JayBaywatch/webcrawler

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: linktrace-0.1.0-py3-none-any.whl
- Subject digest: b15767ec8a8190123e43098496c020294ec6b70056c93ec7923fe6eca585b9ed
- Sigstore transparency entry: 1758378992
- Sigstore integration time: Jun 8, 2026
Source repository:
- Permalink: JayBaywatch/webcrawler@3277e3ad0a3dce087b7b78626adae1000db96b50
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/JayBaywatch
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@3277e3ad0a3dce087b7b78626adae1000db96b50
- Trigger Event: release

linktrace 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

WebCrawler

Key Features

Quick Start

Installation

Core Concepts

Spider

Crawler

Document

Configuration

Basic Crawl

Retries & Timeouts

Caching

SSL & Corporate Proxies

Callbacks: Process Results in Real-Time

Traversal Strategies

Rate Limiting & robots.txt

Broken Link Audit

Export Data

Link Analysis

Notebooks

API Reference

Troubleshooting

"SSL: CERTIFICATE_VERIFY_FAILED"

"Too many connections"

"Crawler hits timeout on deep sites"

Performance

Architecture

Why WebCrawler?

Testing

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance