Extract metadata and URLs from PDF files
Project description
Introduction
Scans pdfs for links written in plaintext and checks if they are active or returns an error code. It then generates a report of its findings. Extract references (pdf, url, doi, arxiv) and metadata from a PDF.
Features
- Extract references and metadata from a given PDF.
- Detects pdf, url, arxiv and doi references.
- Checks for valid SSL certificate.
- Find broken hyperlinks (using the -c flag).
- Output as text or JSON (using the -j flag).
- Extract the PDF text (using the --text flag).
- Use as command-line tool or Python package.
- Works with local and online pdfs.
Installation
Grab a copy of the code with pip or snap:
pip install linkrot
snap install linkrot
Usage
linkrot [pdf-file-or-url]
Run linkrot -h to see the help output:
linkrot -h
usage: linkrot [-h] [-d OUTPUT_DIRECTORY] [-c] [-j] [-v] [-t] [-o OUTPUT_FILE] [--version] pdf
Extract metadata and references from a PDF, and optionally download all referenced PDFs.
Arguments
positional arguments:
pdf (Filename or URL of a PDF file)
optional arguments:
-h, --help (Show this help message and exit)
-d OUTPUT_DIRECTORY, --download-pdfs OUTPUT_DIRECTORY (Download all referenced PDFs into specified directory)
-c, --check-links (Check for broken links)
-j, --json (Output infos as JSON (instead of plain text))
-v, --verbose (Print all references (instead of only PDFs))
-t, --text (Only extract text (no metadata or references))
-o OUTPUT_FILE, --output-file OUTPUT_FILE (Output to specified file instead of console)
--version (Show program's version number and exit)
Examples
Extract text to console
linkrot https://example.com/example.pdf -t
Extract text to file
linkrot https://example.com/example.pdf -t -o pdf-text.txt
Check Links
linkrot https://example.com/example.pdf -c
License
This program is licensed with an MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for linkrot-2.1.1-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 41b78e155ea370869d2f80fca23f0abc055240c9f0dd34779af7b5e09d667472 |
|
MD5 | 3bb61d80a55c6d71c32ae9f072494092 |
|
BLAKE2b-256 | 164bc9c3ada89ad268ea994cbad58b74d591d6453b2f3d85a75c86e4569e6e29 |