Async web crawler with rate limiting, robots.txt support, and broken link tracking
Project description
linktrace
Lightweight async web crawler for link analysis and HTML document processing.
Perfect for: Site structure analysis, link tracking, concurrent page fetching, HTML document transformation.
Not: A replacement for Scrapy. Use this when you need simple, focused crawling with automatic link classification and clean document models.
Key Features
- โก Async/await native โ Built on asyncio + aiohttp for concurrent requests
- ๐ Automatic link classification โ Distinguishes internal vs external links by domain
- ๐ Rich document model โ Full HTML source, parsed links, metadata, headers
- ๐ Persistent sessions โ Connection pooling for 10-100x faster same-domain crawls
- ๐ Retries + backoff โ Exponential backoff for transient errors (timeouts, 5xx)
- โฑ๏ธ Rate limiting โ Per-domain rate limiting with asyncio.Lock, no thundering herd
- ๐ค robots.txt support โ Automatically respect Crawl-delay directives and Disallow rules per domain
- ๐ Broken link tracking โ Audit 404s and 5xx errors for site structure validation
- ๐พ Optional caching โ Disk-based cache (1-day TTL) for repeat crawls
- ๐ SSL verification โ Secure by default, with corporate proxy support
- ๐ช Automatic cookies โ Set-Cookie extraction and sending built-in
- ๐ Traversal strategies โ BFS (broad) or DFS (deep) crawling
- ๐ Multi-format export โ JSON, Pandas, Polars, PyArrow for data analysis
- ๐ Callbacks & streaming โ Process results as crawled without memory buildup
Quick Start
import asyncio
from linktrace import Spider
async def main():
spider = Spider(start_url="https://example.com", max_depth=2)
documents = await spider.run_async()
for doc in documents:
print(f"{doc.url}")
print(f" Internal links: {len(doc.internal_links)}")
print(f" External links: {len(doc.external_links)}")
asyncio.run(main())
Installation
pip install linktrace
Optional export formats:
pip install linktrace[serializers] # pandas + polars + pyarrow
pip install linktrace[pandas] # Just pandas
Core Concepts
Spider
High-level orchestrator that crawls multiple pages using BFS (breadth-first) or DFS (depth-first) traversal.
Crawler
Low-level engine that fetches and parses individual documents. Handles retries, caching, SSL, cookies, sessions.
Document
Rich object containing:
urlโ page URLtitleโ HTML title tagsourceโ raw HTMLinternal_linksโ links to same domainexternal_linksโ links to other domainsstatus_code,response_headers,domainโ metadata
See Core Concepts for more.
Configuration
Basic Crawl
spider = Spider(
start_url="https://example.com",
max_depth=3, # How deep to follow links
traversal_strategy="bfs" # "bfs" (default) or "dfs"
)
documents = await spider.run_async()
Retries & Timeouts
spider = Spider(
start_url="https://example.com",
request_timeout=15, # Seconds per request (default: 30)
max_retries=5, # Retry transient errors (default: 3)
)
Caching
spider = Spider(
start_url="https://example.com",
cache_dir=".linktrace_cache" # Enable disk caching (default: None/disabled)
)
# 2nd run will be 10-50x faster for same URLs
SSL & Corporate Proxies
# Default: verify SSL with system CA
spider = Spider(start_url="https://example.com")
# Corporate proxy with custom CA bundle
spider = Spider(
start_url="https://example.com",
ssl_verify="/path/to/corporate-ca.pem"
)
# Self-signed certs (testing only)
spider = Spider(
start_url="https://example.com",
ssl_verify=False # โ ๏ธ Insecure
)
Cookies are handled automatically โ no configuration needed.
Callbacks: Process Results in Real-Time
For large crawls, avoid memory buildup by processing documents as they're crawled:
# Stream results to disk
async def save_result(doc):
with open("results.jsonl", "a") as f:
f.write(json.dumps({"url": doc.url, "title": doc.title}) + "\n")
spider = Spider(
start_url="https://example.com",
on_page_crawled=save_result,
accumulate_results=False, # Don't keep in memory
)
await spider.run_async() # Returns [], file has results
Callback Hooks:
on_page_crawled(doc)โ Called after each successful crawl. Return value accumulated ifaccumulate_results=Trueon_error(url, exc)โ Called on crawl failureson_crawl_complete()โ Called when crawl finishes (cleanup hook)
Async Callbacks Supported:
async def save_to_db(doc):
await db.insert(doc.url, doc.title)
return doc.url
spider = Spider(
start_url="https://example.com",
on_page_crawled=save_to_db, # Async callback
accumulate_results=True,
)
results = await spider.run_async() # Returns list of URLs
Return Logic:
- No callback โ returns all documents (default)
- Callback +
accumulate_results=Falseโ returns [] (streaming mode) - Callback +
accumulate_results=Trueโ returns callback results
Traversal Strategies
BFS (Breadth-First) โ Default
# Explores level by level: all depth-1 links, then depth-2, etc.
spider = Spider(start_url="https://example.com", max_depth=3, traversal_strategy="bfs")
DFS (Depth-First)
# Follows single paths all the way down before exploring siblings
spider = Spider(start_url="https://example.com", max_depth=5, traversal_strategy="dfs")
Use DFS for deep hierarchies (documentation sites, nested directories). Use BFS for broad exploration.
Rate Limiting & robots.txt
By default, linktrace automatically respects robots.txt Crawl-delay directives and Disallow rules, enforcing per-domain rate limiting:
# Automatic robots.txt respect (default)
spider = Spider(
start_url="https://example.com",
user_agent="MyBot/1.0", # Identifies your bot to robots.txt rules
)
await spider.run_async()
Customize rate limiting:
# Enforce explicit delay (ignores robots.txt)
spider = Spider(
start_url="https://example.com",
request_delay=1.0, # 1 second between requests to same domain
respect_robots_txt=False, # Don't fetch robots.txt
)
# Concurrent requests to different domains, serialized to same domain
await spider.run_async()
Track Crawl Status
Monitor which pages returned error status codes:
spider = Spider(start_url="https://example.com", max_depth=2)
documents = await spider.run_async()
# Find pages with error responses
error_pages = [doc for doc in documents if doc.status_code >= 400]
for doc in error_pages:
print(f"Error: {doc.url} (HTTP {doc.status_code})")
# Monitor disallowed pages (403 from robots.txt)
disallowed = [doc for doc in documents if doc.status_code == 403]
print(f"Disallowed by robots.txt: {len(disallowed)} pages")
Stream crawl status in real-time:
async def track_errors(doc):
if doc.status_code >= 400:
print(f"โ {doc.url} (HTTP {doc.status_code})")
spider = Spider(
start_url="https://example.com",
on_page_crawled=track_errors,
accumulate_results=False,
)
await spider.run_async()
Export Data
from linktrace import Spider, Serializers
spider = Spider(start_url="https://example.com", max_depth=2)
documents = await spider.run_async()
# Export to JSON
serializer = Serializers(documents)
serializer.to_json("crawl.json", include_html=False)
# Export to Pandas (one row per link)
df = serializer.to_pandas()
print(df[["url", "title", "link_url", "link_type"]])
# Export to Polars (faster for large datasets)
df_polars = serializer.to_polars()
# Export to PyArrow (for data pipelines)
table = serializer.to_arrow()
Link Analysis
from collections import Counter
spider = Spider(start_url="https://example.com", max_depth=2)
documents = await spider.run_async()
# Count external domains
external_domains = Counter()
for doc in documents:
for link in doc.external_links:
domain = link.url.split("/")[2]
external_domains[domain] += 1
print(external_domains.most_common(10))
See Examples for more patterns.
Notebooks
Interactive examples in notebooks/:
crawl_cnn.ipynbโ Crawls CNN.com, analyzes link structure, demonstrates all export formats
API Reference
See API Reference for complete method documentation.
Troubleshooting
"SSL: CERTIFICATE_VERIFY_FAILED"
Use ssl_verify=False for self-signed certs (testing only), or ssl_verify="/path/to/ca.pem" for corporate proxies.
"Too many connections"
Reduce concurrency by lowering max_retries or increase timeouts. Default settings are conservative.
"Crawler hits timeout on deep sites"
Try DFS traversal instead of BFS, or increase request_timeout.
See Troubleshooting for more.
Performance
Typical performance (single-domain crawl):
- First run: ~50-500ms per page (network-bound)
- Cached run: ~1-10ms per page (2-50x faster)
- Memory: ~1MB per 100 pages
With persistent sessions + connection pooling, same-domain requests are 10-100x faster than per-request session setup.
Architecture
Spider (orchestrator)
โโ Crawler (persistent session)
โโ aiohttp (HTTP requests + connection pooling)
โโ lxml (HTML parsing)
โโ ResponseCache (optional disk caching)
โโ CookieJar (automatic cookie handling)
Spider manages the crawl queue and traversal. Crawler handles individual document fetching/parsing. All requests share one persistent aiohttp session per Spider instance, so connection pooling, cookies, SSL configuration, and DNS caching are reused across the crawl.
Why linktrace?
Scrapy is an excellent full crawling and extraction framework. linktrace is designed for a narrower job: fast async link analysis with minimal setup.
Instead of building a Scrapy project around spiders, requests, responses, callbacks, items, pipelines, middleware, and settings, linktrace gives you a direct document-centric API. Each crawled URL becomes a Document object containing the page source, title, status code, response headers, domain, internal links, external links, and crawl status metadata.
That makes linktrace useful when your goal is to inspect site structure, trace links, audit crawl status, or export crawl results to dataframe-oriented tools without creating a larger scraping project.
linktrace also reuses a persistent aiohttp session during a crawl. Connection pooling, cookie reuse, SSL configuration, request timeouts, per-host limits, and DNS caching are carried across requests, which can make repeated same-domain crawls much faster than creating a fresh client/session per URL.
Use Scrapy when: you need a mature scraping framework with item pipelines, middleware, schedulers, broad ecosystem support, and complex extraction workflows.
Use linktrace when: you want a focused async crawler that turns URLs into analyzable Document objects with automatic link classification and simple exports.
vs requests + BeautifulSoup: Built-in async concurrency, automatic session reuse, retries, caching, rate limiting, and structured document objects. Better for crawling multiple pages.
vs Selenium: Pure HTTP crawler (no JS execution). Faster, lighter, but can't handle dynamic sites.
Testing
just test # Run all tests
just test-cov # Run with coverage report
All 91 tests pass. 100% of core crawling paths tested (rate limiting, broken link tracking, robots.txt, callbacks).
Contributing
Bug reports and pull requests welcome on GitHub.
License
MIT
Documentation:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file linktrace-0.2.9.tar.gz.
File metadata
- Download URL: linktrace-0.2.9.tar.gz
- Upload date:
- Size: 169.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
349c5b2aaa42e510cbbc791d35489975d222ce3dc3e3760540292924c5b4a6b3
|
|
| MD5 |
1efe186d070478c87dda52c2536cae7d
|
|
| BLAKE2b-256 |
e0934efe6d67ece4e6436e3466b66dd950970a16a974a1b2b3f2ce081b3a122d
|
Provenance
The following attestation bundles were made for linktrace-0.2.9.tar.gz:
Publisher:
publish.yml on JayBaywatch/linktrace
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
linktrace-0.2.9.tar.gz -
Subject digest:
349c5b2aaa42e510cbbc791d35489975d222ce3dc3e3760540292924c5b4a6b3 - Sigstore transparency entry: 1760378263
- Sigstore integration time:
-
Permalink:
JayBaywatch/linktrace@9be72cefe2c05ec7ba049bdf8f51ed5cbd8fc14c -
Branch / Tag:
refs/tags/v0.2.9 - Owner: https://github.com/JayBaywatch
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9be72cefe2c05ec7ba049bdf8f51ed5cbd8fc14c -
Trigger Event:
release
-
Statement type:
File details
Details for the file linktrace-0.2.9-py3-none-any.whl.
File metadata
- Download URL: linktrace-0.2.9-py3-none-any.whl
- Upload date:
- Size: 17.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e5cdf17554c509ab6f5e0e0b455da928cf6a71aaa5eca94ec80b180fa7b9f8bb
|
|
| MD5 |
7f043bce1b3471cc69a6243e5d35759a
|
|
| BLAKE2b-256 |
665a9883ce3ef6e85058a411dbce5cc8ae03dc71fdf35817550b3e033efb7d6d
|
Provenance
The following attestation bundles were made for linktrace-0.2.9-py3-none-any.whl:
Publisher:
publish.yml on JayBaywatch/linktrace
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
linktrace-0.2.9-py3-none-any.whl -
Subject digest:
e5cdf17554c509ab6f5e0e0b455da928cf6a71aaa5eca94ec80b180fa7b9f8bb - Sigstore transparency entry: 1760378304
- Sigstore integration time:
-
Permalink:
JayBaywatch/linktrace@9be72cefe2c05ec7ba049bdf8f51ed5cbd8fc14c -
Branch / Tag:
refs/tags/v0.2.9 - Owner: https://github.com/JayBaywatch
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9be72cefe2c05ec7ba049bdf8f51ed5cbd8fc14c -
Trigger Event:
release
-
Statement type: