A tool to scrape a website and check for the broken links.
Project description
Broken links Checker GitHub Action && command line tool
This tool scrapes all pages within a specified URL and checks if the destination links exist. It reports the original page, the text of the anchor, the destination URL, and whether the link is working or not. If any link does not work, the tool exits with an error code. It also provides a summary of the analysis.
It can be run as a GitHub Action or as a command line tool.
GitHub Action
This tool can also be used as a GitHub Action to automatically check links in your repository.
Inputs
url
(optional): The base URL to start scraping from. Default ishttp://localhost:4444/
.only-errors
(optional): If set to true, only display errors. Default isfalse
.ignore-file
(optional): Path to the ignore file. Default is./check-ignore
. If the parameter is set and the file does not exist, the action exits with an error. See Ignore File Format section above for more information.
Ignore File Format
The ignore file should contain one URL pattern per line. The patterns can include wildcards (*) to match multiple URLs. Here are some examples:
http://example.com/ignore-this-page
- Ignores this specific URL.http://example.com/ignore/*
- Ignores all URLs that start withhttp://example.com/ignore/
.*/ignore-this-path/*
- Ignores all URLs that contain/ignore-this-path/
.https://*.domain.com*
- Ignores all subdomains ofdomain.com
such ashttps://sub.domain.com
orhttps://sub2.domain.com/page
, etc.
Outputs
This action does not produce any outputs. However, at the end of the analysis, it prints a summary of the results with:
- Number of pages analyzed
- Number of links analyzed
- Total number of links working
- Total number of links not working
- Number of external links working
- Number of external links not working
- Number of internal links working
- Number of internal links not working
Examples of Usage
Basic Usage (external URL)
name: Broken-links Checker
on: [push]
jobs:
check-links:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v2
- name: Run Link Checker
uses: merlos/broken-links@0.2.2
with:
url: 'http://example.com'
only-errors: 'true'
Check links with MkDocs
name: MkDocs Preview and Link Check
on:
push:
branches:
- main
jobs:
preview_and_check:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.x'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install mkdocs mkdocs-material
- name: Run MkDocs server
run: mkdocs serve -a 0.0.0.0:4444 &
continue-on-error: true
- name: Wait for server to start
run: sleep 10
- name: Run Link Checker
uses: merlos/broken-links@0.2.2
with:
url: 'http://localhost:4444'
only-errors: 'true'
ignore-file: './check-ignore'
Check links with Quarto
name: Quarto Preview and Link Check
on:
push:
branches:
- main
jobs:
preview_and_check:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v2
- name: Set up Quarto
uses: quarto-dev/quarto-actions/setup@v2
- name: Render Quarto project
run: quarto preview --port 444 &
continue-on-error: true
- name: Wait for server to start
run: sleep 10
- name: Run Link Checker
uses: merlos/broken-links@0.2.2
with:
url: 'http://localhost:444'
only-errors: 'true'
ignore-file: './check-ignore'
Command-Line Utility
Installation
-
Clone the repository:
git clone https://github.com/merlos/broken-links.git cd broken-links
-
Install the package:
pip install .
-
Use the
broken-links
command to run the script:
broken-links http://example.com --only-error --ignore-file ./check-ignore
Command-line arguments:
url
(optional): The base URL to start scraping from. Default ishttp://localhost:4444/
.--only-error
or-o
(optional): If set, only display errors. Default isfalse
.--ignore-file
or-i
(optional): Path to the ignore file. Default is./check-ignore
. If the parameter is NOT set and the file does not exist, it checks all the links. If the parameter is set and the file does not exist, the tool exits with an error.
Development
Clone the repository:
git clone https://github.com/merlos/broken-links
cd broken-links
Set a virtual environment:
python -m venv venv
source venv/bin/activate
Install the package in edit mode (-e
)
pip install -e .
Start coding!
Build the docker image
docker build -t broken-links .
docker run --rm broken-links http://example.com --only-error --ignore-file ./check-ignore
Tests
To run the tests, use the following command:
python -m unittest discover tests
Contributing
Fork and send a pull request. Please update/add the unit tests.
License
This project is licensed under the terms of the GNU General Public License v3.0 by merlos.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file broken_links-0.2.0.tar.gz
.
File metadata
- Download URL: broken_links-0.2.0.tar.gz
- Upload date:
- Size: 17.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7eb6851732a2bc8873c37c6f144c0fa378a21da890ac8becf03f4324bf92b419 |
|
MD5 | 0d943c11a688ed6a635ca3a26d696697 |
|
BLAKE2b-256 | 91e0b125cf5d6ec617bfe5059a0fc92103de8b0bd30308112fad4a6b62fefd9c |
File details
Details for the file broken_links-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: broken_links-0.2.0-py3-none-any.whl
- Upload date:
- Size: 19.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f17738a332d0795cbd908059648fee1f347bf9ebe83793ddcb97126a90c238a8 |
|
MD5 | 2aaee228804c4d6d45dfdf0b6b32e5d3 |
|
BLAKE2b-256 | 174792d0cbbface6130cc82b06bd2809932598200dc58f723ec9b1d5f7008263 |