Extract metadata and URLs from PDF files
Project description
Introduction
Scans PDFs for links written in plaintext and checks if they are active or returns an error code. It then generates a report of its findings. Extract references (PDF, URL, DOI, arXiv) and metadata from a PDF.
Check out our sister project, Rotting Research, for a web app implementation of this project.
Features
- Extract references and metadata from a given PDF.
- Detects PDF, URL, arXiv and DOI references.
- Archives valid links using Internet Archive's Wayback Machine (using the -a flag).
- Checks for valid SSL certificate.
- Find broken hyperlinks (using the -c flag).
- Output as text or JSON (using the -j flag).
- Extract the PDF text (using the --text flag).
- Use as command-line tool or Python package.
- Works with local and online PDFs.
Installation
Grab a copy of the code with pip:
pip install linkrot
Usage
linkrot can be used to extract info from a PDF in two ways:
- Command line/Terminal tool
linkrot
- Python library
import linkrot
1. Command Line/Terminal tool
linkrot [pdf-file-or-url]
Run linkrot -h to see the help output:
linkrot -h
usage:
linkrot [-h] [-d OUTPUT_DIRECTORY] [-c] [-j] [-v] [-t] [-o OUTPUT_FILE] [--version] pdf
Extract metadata and references from a PDF, and optionally download all referenced PDFs.
Arguments
positional arguments:
pdf (Filename or URL of a PDF file)
optional arguments:
-h, --help (Show this help message and exit)
-d OUTPUT_DIRECTORY, --download-pdfs OUTPUT_DIRECTORY (Download all referenced PDFs into specified directory)
-c, --check-links (Check for broken links)
-j, --json (Output infos as JSON (instead of plain text))
-v, --verbose (Print all references (instead of only PDFs))
-t, --text (Only extract text (no metadata or references))
-a, --archive (Archive actvice links)
-o OUTPUT_FILE, --output-file OUTPUT_FILE (Output to specified file instead of console)
--version (Show program's version number and exit)
PDF Samples
For testing purposes, you can find PDF samples in shared MEGA folder](https://mega.nz/folder/uwBxVSzS#lpBtSz49E9dqHtmrQwp0Ig).
Examples
Extract text to console.
linkrot https://example.com/example.pdf -t
Extract text to file
linkrot https://example.com/example.pdf -t -o pdf-text.txt
Check Links
linkrot https://example.com/example.pdf -c
2. Main Python Library
Import the library:
import linkrot
Create an instance of the linkrot class like so:
pdf = linkrot.linkrot("filename-or-url.pdf") #pdf is the instance of the linkrot class
Now the following function can be used to extract specific data from the pdf:
get_metadata()
Arguments: None
Usage:
metadata = pdf.get_metadata() #pdf is the instance of the linkrot class
Return type: Dictionary <class 'dict'>
Information Provided: All metadata, secret metadata associated with the PDF including Creation date, Creator, Title, etc...
get_text()
Arguments: None
Usage:
text = pdf.get_text() #pdf is the instance of the linkrot class
Return type: String <class 'str'>
Information Provided: The entire content of the PDF in string form.
get_references(reftype=None, sort=False)
Arguments:
reftype: The type of reference that is needed
values: 'pdf', 'url', 'doi', 'arxiv'.
default: Provides all reference types.
sort: Whether reference should be sorted or not
values: True or False.
default: Is not sorted.
Usage:
references_list = pdf.get_references() #pdf is the instance of the linkrot class
Return type: Set <class 'set'>
of <linkrot.backends.Reference object>
linkrot.backends.Reference object has 3 member variables:
- ref: actual URL/PDF/DOI/ARXIV
- reftype: type of reference
- page: page on which it was referenced
Information Provided: All references with their corresponding type and page number.
get_references_as_dict(reftype=None, sort=False)
Arguments:
reftype: The type of reference that is needed
values: 'pdf', 'url', 'doi', 'arxiv'.
default: Provides all reference types.
sort: Whether reference should be sorted or not
values: True or False.
default: Is not sorted.
Usage:
references_dict = pdf.get_references_as_dict() #pdf is the instance of the linkrot class
Return type: Dictionary <class 'dict'>
with keys 'pdf', 'url', 'doi', 'arxiv' that each have a list <class 'list'>
of refs of that type.
Information Provided: All references in their corresponding type list.
download_pdfs(target_dir)
Arguments:
target_dir: The path of the directory to which the reference PDFs should be downloaded
Usage:
pdf.download_pdfs("target-directory") #pdf is the instance of the linkrot class
Return type: None
Information Provided: Downloads all the reference PDFs to the specified directory.
3. Linkrot downloader functions
Import:
from linkrot.downloader import sanitize_url, get_status_code, check_refs
sanitize_url(url)
Arguments:
url: The url to be sanitized.
Usage:
new_url = sanitize_url(old_url)
Return type: String <class 'str'>
Information Provided: URL is prefixed with 'http://' if it was not before and makes sure it is in utf-8 format.
get_status_code(url)
Arguments:
url: The url to be checked for its status.
Usage:
status_code = get_status_code(url)
Return type: String <class 'str'>
Information Provided: Checks if the URL is active or broken.
check_refs(refs, verbose=True, max_threads=MAX_THREADS_DEFAULT)
Arguments:
refs: set of linkrot.backends.Reference objects
verbose: whether it should print every reference with its code or just the summary of the link checker
max_threads: number of threads for multithreading
Usage:
check_refs(pdf.get_references()) #pdf is the instance of the linkrot class
Return type: None
Information Provided: Prints references with their status code and a summary of all the broken/active links on terminal.
4. Linkrot extractor functions
Import:
from linkrot.extractor import extract_urls, extract_doi, extract_arxiv
Get pdf text:
text = pdf.get_text() #pdf is the instance of the linkrot class
extract_urls(text)
Arguments:
text: String of text to extract urls from
Usage:
urls = extract_urls(text)
Return type: Set <class 'set'>
of URLs
Information Provided: All URLs in the text
extract_arxiv(text)
Arguments:
text: String of text to extract arXivs from
Usage:
arxiv = extract_arxiv(text)
Return type: Set <class 'set'>
of arxivs
Information Provided: All arXivs in the text
extract_doi(text)
Arguments:
text: String of text to extract DOIs from
Usage:
doi = extract_doi(text)
Return type: Set <class 'set'>
of DOIs
Information Provided: All DOIs in the text
Code of Conduct
To view our code of conduct please visit our Code of Conduct page.
License
This program is licensed with an GPLv3 License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file linkrot-5.2.1.tar.gz
.
File metadata
- Download URL: linkrot-5.2.1.tar.gz
- Upload date:
- Size: 27.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6544d6f6004547ba12f03476f7931082d3241f12d81de899bf0e3a3ead1459ff |
|
MD5 | 6607fd02bfc5cfd7697212f7eb6d0a7a |
|
BLAKE2b-256 | fb869d76739582bc8a0f2e5e2f0400c086ee89bfa4917d0206e29339aceef190 |
File details
Details for the file linkrot-5.2.1-py3-none-any.whl
.
File metadata
- Download URL: linkrot-5.2.1-py3-none-any.whl
- Upload date:
- Size: 30.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 68502965636ac2e2e5e7d11b3b05084fe3568c73a78878d28cb59f39861791b3 |
|
MD5 | c5175bde7889520b1669fc418f5414e1 |
|
BLAKE2b-256 | 452adaeb932e2c36e890635898f5ac0c8840bea0fa6e42c179c96208b7d1294f |