A tool for checking broken links and cataloging internal assets on websites
Project description
Link Checker
A Python tool that checks websites for broken links and catalogs internal assets.
Features
- Crawls websites starting from a root URL that respects URL hierarchy boundaries (won't crawl "up" from the starting URL)
- Detects broken internal links
- Catalogs references to non-HTML assets (images, text files, etc.)
- Only visits each page once
- Ignores external links
- Provides detailed logging
- Allows specifying paths to exclude from internal asset reporting
- Supports checking but not crawling specific website sections
Installation
pip install rms-link-checker
Or from source:
git clone https://github.com/SETI/rms-link-checker.git
cd rms-link-checker
pip install -e .
Usage
link_checker https://example.com
Options
--verboseor-v: Increase verbosity (can be used multiple times)--outputor-o: Specify output file for results (default: stdout)--log-file: Write log messages to a file (in addition to console output)--log-level: Set the minimum level for messages in the log file (DEBUG, INFO, WARNING, ERROR, CRITICAL)--timeout: Timeout in seconds for HTTP requests (default: 10.0)--max-requests: Maximum number of requests to make (default: unlimited)--max-depth: Maximum depth to crawl (default: unlimited)--ignore-asset-paths-file: Specify a file containing paths to ignore when reporting internal assets (one per line)--ignore-internal-paths-file: Specify a file containing paths to check once but not crawl (one per line)
Examples
Simple check:
link_checker https://example.com
Check a specific section of a website (won't crawl to parent directories):
link_checker https://example.com/section/subsection
Ignore specific asset paths:
# Create a file with paths to ignore
echo "/images" > ignore_assets.txt
echo "css" >> ignore_assets.txt # Leading slash is optional
echo "scripts" >> ignore_assets.txt
link_checker https://example.com --ignore-asset-paths-file ignore_assets.txt
Check but don't crawl specific sections:
# Create a file with paths to check but not crawl
echo "docs" > ignore_crawl.txt # Leading slash is optional
echo "/blog" >> ignore_crawl.txt
link_checker https://example.com --ignore-internal-paths-file ignore_crawl.txt
Verbose output with detailed logging:
link_checker https://example.com -vv
Verbose output with logs written to a file:
link_checker https://example.com -vv --log-file=link_checker.log
Verbose output with logs written to a file, but only warnings and errors:
link_checker https://example.com -vv --log-file=link_checker.log --log-level=WARNING
Limit crawl depth and set a longer timeout:
link_checker https://example.com --max-depth=3 --timeout=30.0
Limit the number of requests to avoid overwhelming the server:
link_checker https://example.com --max-requests=50
Report Format
The report includes:
- Configuration summary (root URL, hierarchy boundary, and ignored paths)
- Broken links found (grouped by page)
- Internal assets (grouped by type)
- Summary with counts (visited pages, broken links, assets)
- Stats on ignored assets, limited-crawl sections, and URLs outside hierarchy
Contributing
Information on contributing to this package can be found in the Contributing Guide.
Links
Licensing
This code is licensed under the Apache License v2.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rms_link_checker-0.0.2.dev0.tar.gz.
File metadata
- Download URL: rms_link_checker-0.0.2.dev0.tar.gz
- Upload date:
- Size: 35.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
30a60bcc25eccc93b9e10f2066850e4bb49120201c6f93c9b9a4dc3aabeb5abc
|
|
| MD5 |
69a5ae37d283519b06b5a0b948904bfd
|
|
| BLAKE2b-256 |
729e90f6b5eef447a8c60c43a54fe81965319db3bca5461e2744eca77022bbc5
|
File details
Details for the file rms_link_checker-0.0.2.dev0-py3-none-any.whl.
File metadata
- Download URL: rms_link_checker-0.0.2.dev0-py3-none-any.whl
- Upload date:
- Size: 18.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ba5dc5610c003f318e08ec6dc306dd3cf7fc49c8004419751449c74e362ee6a1
|
|
| MD5 |
fd559d4799c687766b7ea43ea7eed3d2
|
|
| BLAKE2b-256 |
2a27e8f5e41b00b0e84554befe09fad920a9a53f7ab4613ad0f8d4f4fc8d6ff5
|