Skip to main content

A purpose-built PDF link analysis and reporting tool with GUI and CLI.

Project description

pdflinkcheck

A purpose-built tool for comprehensive analysis of hyperlinks and link remnants within PDF documents, primarily using the PyMuPDF library. Use the CLI or the GUI.


Screenshot of the pdflinkcheck GUI


📥 Access and Installation

The recommended way to use pdflinkcheck is to either install the CLI with pipx or to download the appropriate latest binary for your system from Releases.

🚀 Recommended Access (Binary Files)

For the most user-typical experience, download the single-file binary matching your OS.

File Type Primary Use Case Recommended Launch Method
Executable (.exe, .elf, .pyz) GUI (Double-Click) Double-click the file (use the accompanying .bat file on Windows).
PYZ (Python Zip App) CLI (Terminal) Run using your system's python command: python pdflinkcheck-VERSION.pyz analyze ...

Installation via pipx

For an isolated environment where you can access pdflinkcheck from any terminal:

# Ensure you have pipx installed first (if not, run: pip install pipx)
pipx install pdflinkcheck

💻 Graphical User Interface (GUI)

The tool can be run as simple cross-platform graphical interface (Tkinter).

Launching the GUI

There are three ways to launch the GUI interface:

  1. Implicit Launch: Run the main command with no arguments, subcommands, or flags (pdflinkcheck).
  2. Explicit Command: Use the dedicated GUI subcommand (pdflinkcheck gui).
  3. Binary Double-Click:
    • Windows: Double-click the pdflinkcheck-VERSION-gui.bat file.
    • macOS/Linux: Double-click the downloaded .pyz or .elf file.

Planned GUI Updates

We are actively working on the following enhancements:

  • Report Export: Functionality to export the full analysis report to a plain text file.
  • License Visibility: A dedicated "License Info" button within the GUI to display the terms of the AGPLv3+ license.

🚀 CLI Usage

The core functionality is accessed via the analyze command. All commands include the built-in --help flag for quick reference.

Available Commands

Command Description
pdflinkcheck analyze Analyzes a PDF file for links and remnants.
pdflinkcheck gui Explicitly launch the Graphical User Interface.
pdflinkcheck license Displays the full AGPLv3+ license text in the terminal.

analyze Command Options

Option Description Default
<PDF_PATH> Required. The path to the PDF file to analyze. N/A
--check-remnants / --no-check-remnants Toggle scanning the text layer for unlinked URLs/Emails. --check-remnants
--max-links INTEGER Maximum number of links/remnants to display in the detailed report sections. Use 0 to show all. 0 (Show All)
--export-format FORMAT Format for the exported report. If specified, the report is saved to a file named after the PDF. Currently supported: JSON. JSON
--help Show command help and exit. N/A

gui Command Options

Option Description Default
--auto-close INTEGER (For testing/automation only). Delay in milliseconds after which the GUI window will automatically close. 0 (Disabled)

Example Runs

# Analyze a document, show all links/remnants, and save the report as JSON
pdflinkcheck analyze "TE Maxson WWTF O&M Manual.pdf" --export-format JSON

# Analyze a document but skip the time-consuming remnant check
pdflinkcheck analyze "another_doc.pdf" --no-check-remnants 

# Analyze a document but keep the print block short, showing only the first 10 links for each type
pdflinkcheck analyze "TE Maxson WWTF O&M Manual.pdf" --max-links 10

# Show the GUI for only a moment, like in a build check
pdflinkcheck gui --auto-close 3000 

📦 Library Access (Advanced)

For developers importing pdflinkcheck into other Python projects, the core analysis functions are exposed directly in the root namespace:

Function Description
run_analysis() (Primary function) Performs the full analysis, prints to console, and handles file export.
extract_links() Low-level function to retrieve all explicit links (URIs, GoTo, etc.) from a PDF path.
extract_toc() Low-level function to extract the PDF's internal Table of Contents (bookmarks/outline).

Python

from pdflinkcheck.analyze import run_analysis, extract_links, extract_toc

✨ Features

  • Active Link Extraction: Identifies and categorizes all programmed links (External URIs, Internal GoTo/Destinations, Remote Jumps).
  • Anchor Text Retrieval: Extracts the visible text corresponding to each link's bounding box.
  • Remnant Detection: Scans the document's text layer for unlinked URIs and email addresses that should potentially be converted into active links.
  • Structural TOC: Extracts the PDF's internal Table of Contents (bookmarks/outline).

📜 License Implications (AGPLv3+)

pdflinkcheck is licensed under the GNU Affero General Public License version 3 or later (AGPLv3+).

This license has significant implications for distribution and network use, particularly for organizations:

  • Source Code Provision: If you distribute this tool (modified or unmodified) to anyone, you must provide the full source code under the same license.
  • Network Interaction (Affero Clause): If you modify this tool and make the modified version available to users over a computer network (e.g., as a web service or backend), you must also offer the source code to those network users.

Before deploying or modifying this tool for organizational use, especially for internal web services or distribution, please ensure compliance with the AGPLv3+ terms.


⚠️ Compatibility Notes

  • Platform Compatibility: This tool relies on the PyMuPDF library. All testing has failed to run in a Termux (Android) environment due to underlying C/C++ library compilation issues with PyMuPDF. It is recommended for use on standard Linux, macOS, or Windows operating systems.
  • Document Compatibility: While pdflinkcheck uses the robust PyMuPDF library, not all PDF files can be processed successfully. This tool is designed primarily for digitally generated (vector-based) PDFs. Processing may fail or yield incomplete results for:
    • Scanned PDFs (images of text) that lack an accessible text layer.
    • Encrypted or Password-Protected documents.
    • Malformed or non-standard PDF files.

Run from Source (Developers)

git clone http://github.com/city-of-memphis-wastewater/pdflinkcheck.git
cd pdflinkcheck
uv sync
uv run python src/pdflinkcheck/cli.py --help

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdflinkcheck-1.1.43.tar.gz (46.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdflinkcheck-1.1.43-py3-none-any.whl (44.9 kB view details)

Uploaded Python 3

File details

Details for the file pdflinkcheck-1.1.43.tar.gz.

File metadata

  • Download URL: pdflinkcheck-1.1.43.tar.gz
  • Upload date:
  • Size: 46.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pdflinkcheck-1.1.43.tar.gz
Algorithm Hash digest
SHA256 9d8e1847c708c822427f9ad0ba5a78da8087c2205535c96c60489dbb10312364
MD5 32999052d68b74d0769abefaa0c1b46e
BLAKE2b-256 dd13faa19f146b446c813ff1f43f24f93bc0ef484422b069cad18993198fe1f1

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdflinkcheck-1.1.43.tar.gz:

Publisher: publish.yml on City-of-Memphis-Wastewater/pdflinkcheck

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pdflinkcheck-1.1.43-py3-none-any.whl.

File metadata

  • Download URL: pdflinkcheck-1.1.43-py3-none-any.whl
  • Upload date:
  • Size: 44.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pdflinkcheck-1.1.43-py3-none-any.whl
Algorithm Hash digest
SHA256 b6ea431715f24a1248d6da22a43b8699406c11485f624df01109c3098196774c
MD5 921e97f2af2a18925453d5235bb18c1a
BLAKE2b-256 9cc34e5d0dc459d42034873cb40f939e8d482cf2042a2f4189fe18e78538f3d4

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdflinkcheck-1.1.43-py3-none-any.whl:

Publisher: publish.yml on City-of-Memphis-Wastewater/pdflinkcheck

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page