Check stalled links inside PDFs
Project description
pdflinks
pdflinks is a tool which (1) extracts links from PDFs and (2) checks they are
not stalled (timeouts, HTTP 4xx/5xx, only HTTP, etc).
It is not targeting a CI pipeline usage. Most of the messages are false positives and that is inherent with the task. Instead, it is meant to be run from time to time, with the output checked by a human.
Usage
⟩ uvx pdflinks full-audio-slides.pdf
full-audio-slides.pdf: 5s timeout: https://www.bluez.org/
full-audio-slides.pdf: skipped 'http' request: http://www.dest-unreach.org/socat/
full-audio-slides.pdf: 404 HTTP code: https://elixir.bootlin.com/linux/latest/source/sound/soc/samsung/neo1973_wm8753.c
full-audio-slides.pdf: 404 HTTP code: https://elixir.bootlin.com/linux/latest/source/Documentation/devicetree/bindings/sound/atmel-wm8904.txt
⟩ # only do link extraction:
⟩ uvx pdflinks -l full-audio-slides.pdf | wc -l
131
⟩ # warn on redirect:
⟩ uvx pdflinks --warn-on-redirects full-audio-slides.pdf | wc -l
91
How it works
- Extract URLs from PDFs.
- Group them up by domain.
- Distribute domains to a few workers. A worker loops over URLs, making requests and reporting errors (but continuing).
Notes:
- We start work with domains that have the most URLs. It will probably take the longest.
- It is more efficient to call it once rather than N times. That way it groups URLs together, dedups them and doesn't wait on a trailing domain with many URLs.
- The domain grouping means we never send more than one concurrent request to
any domain. We don't sleep and don't respect
robots.txthowever. - We lie about our
User-Agentto avoid being caught by Anubis & co. - When an error occurs, we print it once per PDF in which it appeared. That means grepping the output for one specific PDF works as expected.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdflinks-0.1.3.tar.gz.
File metadata
- Download URL: pdflinks-0.1.3.tar.gz
- Upload date:
- Size: 16.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.10 {"installer":{"name":"uv","version":"0.10.10","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Arch Linux","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
efc61617b712feb9aa92f275cef5b0ec8f54bbef30991e7704d266ad0dcf89f3
|
|
| MD5 |
2aac2b96ecdf48634677bf9d1f995ac3
|
|
| BLAKE2b-256 |
5cba39a5b1c163a3f8a5f109b27ec2635ac723115344a8241fc715f4c609fa9e
|
File details
Details for the file pdflinks-0.1.3-py3-none-any.whl.
File metadata
- Download URL: pdflinks-0.1.3-py3-none-any.whl
- Upload date:
- Size: 16.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.10 {"installer":{"name":"uv","version":"0.10.10","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Arch Linux","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4d694a855071b157527116367bf87680e3c7a939b4b788e20516e93317c2f982
|
|
| MD5 |
165b996e34b940f232ae4a0a97caf632
|
|
| BLAKE2b-256 |
5ad1f2d1e2994e0936dff3eeb82de4d687fb486389ebc2742ee6db8d1670f619
|