Skip to main content

Extract metadata and URLs from PDF files

Project description

linkrot logo

Introduction

Scans PDFs for links written in plaintext and checks if they are active or returns an error code. It then generates a report of its findings. Extract references (PDF, URL, DOI, arXiv) and metadata from a PDF.

Check out our sister project, Rotting Research, for a web app implementation of this project.

Features

  • Extract references and metadata from a given PDF.
  • Detects PDF, URL, arXiv and DOI references.
  • Archives valid links using Internet Archive's Wayback Machine (using the -a flag).
  • Checks for valid SSL certificate.
  • Find broken hyperlinks (using the -c flag).
  • Output as text or JSON (using the -j flag).
  • Extract the PDF text (using the --text flag).
  • Use as command-line tool or Python package.
  • Works with local and online PDFs.

Installation

Grab a copy of the code with pip:

pip install linkrot

Usage

linkrot can be used to extract info from a PDF in two ways:

  • Command line/Terminal tool linkrot
  • Python library import linkrot

1. Command Line/Terminal tool

linkrot [pdf-file-or-url]

Run linkrot -h to see the help output:

linkrot -h

usage:

linkrot [-h] [-d OUTPUT_DIRECTORY] [-c] [-j] [-v] [-t] [-o OUTPUT_FILE] [--version] pdf

Extract metadata and references from a PDF, and optionally download all referenced PDFs.

Arguments

positional arguments:

pdf (Filename or URL of a PDF file)

optional arguments:

-h, --help            (Show this help message and exit)  
-d OUTPUT_DIRECTORY,  --download-pdfs OUTPUT_DIRECTORY (Download all referenced PDFs into specified directory)  
-c, --check-links     (Check for broken links)  
-j, --json            (Output infos as JSON (instead of plain text))  
-v, --verbose         (Print all references (instead of only PDFs))  
-t, --text            (Only extract text (no metadata or references))  
-a, --archive	  (Archive actvice links)
-o OUTPUT_FILE,        --output-file OUTPUT_FILE (Output to specified file instead of console)  
--version             (Show program's version number and exit)  

PDF Samples

For testing purposes, you can find PDF samples in shared MEGA folder](https://mega.nz/folder/uwBxVSzS#lpBtSz49E9dqHtmrQwp0Ig).

Examples

Extract text to console.

linkrot https://example.com/example.pdf -t

Extract text to file

linkrot https://example.com/example.pdf -t -o pdf-text.txt

Check Links

linkrot https://example.com/example.pdf -c

2. Main Python Library

Import the library:

import linkrot

Create an instance of the linkrot class like so:

pdf = linkrot.linkrot("filename-or-url.pdf") #pdf is the instance of the linkrot class

Now the following function can be used to extract specific data from the pdf:

get_metadata()

Arguments: None

Usage:

metadata = pdf.get_metadata() #pdf is the instance of the linkrot class

Return type: Dictionary <class 'dict'>

Information Provided: All metadata, secret metadata associated with the PDF including Creation date, Creator, Title, etc...

get_text()

Arguments: None

Usage:

text = pdf.get_text() #pdf is the instance of the linkrot class

Return type: String <class 'str'>

Information Provided: The entire content of the PDF in string form.

get_references(reftype=None, sort=False)

Arguments:

reftype: The type of reference that is needed 
	 values: 'pdf', 'url', 'doi', 'arxiv'. 
	 default: Provides all reference types.

sort: Whether reference should be sorted or not
      values: True or False. 
      default: Is not sorted.

Usage:

references_list = pdf.get_references() #pdf is the instance of the linkrot class

Return type: Set <class 'set'> of <linkrot.backends.Reference object>

linkrot.backends.Reference object has 3 member variables:
- ref: actual URL/PDF/DOI/ARXIV
- reftype: type of reference
- page: page on which it was referenced

Information Provided: All references with their corresponding type and page number.

get_references_as_dict(reftype=None, sort=False)

Arguments:

reftype: The type of reference that is needed 
	 values: 'pdf', 'url', 'doi', 'arxiv'. 
	 default: Provides all reference types.

sort: Whether reference should be sorted or not
      values: True or False. 
      default: Is not sorted.

Usage:

references_dict = pdf.get_references_as_dict() #pdf is the instance of the linkrot class

Return type: Dictionary <class 'dict'> with keys 'pdf', 'url', 'doi', 'arxiv' that each have a list <class 'list'> of refs of that type.

Information Provided: All references in their corresponding type list.

download_pdfs(target_dir)

Arguments:

target_dir: The path of the directory to which the reference PDFs should be downloaded 

Usage:

pdf.download_pdfs("target-directory") #pdf is the instance of the linkrot class

Return type: None

Information Provided: Downloads all the reference PDFs to the specified directory.

3. Linkrot downloader functions

Import:

from linkrot.downloader import sanitize_url, get_status_code, check_refs

sanitize_url(url)

Arguments:

url: The url to be sanitized.

Usage:

new_url = sanitize_url(old_url) 

Return type: String <class 'str'>

Information Provided: URL is prefixed with 'http://' if it was not before and makes sure it is in utf-8 format.

get_status_code(url)

Arguments:

url: The url to be checked for its status. 

Usage:

status_code = get_status_code(url) 

Return type: String <class 'str'>

Information Provided: Checks if the URL is active or broken.

check_refs(refs, verbose=True, max_threads=MAX_THREADS_DEFAULT)

Arguments:

refs: set of linkrot.backends.Reference objects
verbose: whether it should print every reference with its code or just the summary of the link checker
max_threads: number of threads for multithreading

Usage:

check_refs(pdf.get_references()) #pdf is the instance of the linkrot class

Return type: None

Information Provided: Prints references with their status code and a summary of all the broken/active links on terminal.

4. Linkrot extractor functions

Import:

from linkrot.extractor import extract_urls, extract_doi, extract_arxiv

Get pdf text:

text = pdf.get_text() #pdf is the instance of the linkrot class

extract_urls(text)

Arguments:

text: String of text to extract urls from

Usage:

urls = extract_urls(text)

Return type: Set <class 'set'> of URLs

Information Provided: All URLs in the text

extract_arxiv(text)

Arguments:

text: String of text to extract arXivs from

Usage:

arxiv = extract_arxiv(text)

Return type: Set <class 'set'> of arxivs

Information Provided: All arXivs in the text

extract_doi(text)

Arguments:

text: String of text to extract DOIs from

Usage:

doi = extract_doi(text)

Return type: Set <class 'set'> of DOIs

Information Provided: All DOIs in the text

Code of Conduct

To view our code of conduct please visit our Code of Conduct page.

License

This program is licensed with an GPLv3 License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

linkrot-5.2.1.tar.gz (27.1 kB view details)

Uploaded Source

Built Distribution

linkrot-5.2.1-py3-none-any.whl (30.1 kB view details)

Uploaded Python 3

File details

Details for the file linkrot-5.2.1.tar.gz.

File metadata

  • Download URL: linkrot-5.2.1.tar.gz
  • Upload date:
  • Size: 27.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for linkrot-5.2.1.tar.gz
Algorithm Hash digest
SHA256 6544d6f6004547ba12f03476f7931082d3241f12d81de899bf0e3a3ead1459ff
MD5 6607fd02bfc5cfd7697212f7eb6d0a7a
BLAKE2b-256 fb869d76739582bc8a0f2e5e2f0400c086ee89bfa4917d0206e29339aceef190

See more details on using hashes here.

File details

Details for the file linkrot-5.2.1-py3-none-any.whl.

File metadata

  • Download URL: linkrot-5.2.1-py3-none-any.whl
  • Upload date:
  • Size: 30.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for linkrot-5.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 68502965636ac2e2e5e7d11b3b05084fe3568c73a78878d28cb59f39861791b3
MD5 c5175bde7889520b1669fc418f5414e1
BLAKE2b-256 452adaeb932e2c36e890635898f5ac0c8840bea0fa6e42c179c96208b7d1294f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page