Skip to main content

A small example package

Project description

linkrotlinkrot

Introduction

Extract references (pdf, url, doi, arxiv) and metadata from a PDF. Optionally download all referenced PDFs and check for broken links.

Features

Extract references and metadata from a given PDF. Detects pdf, url, arxiv and doi references. Find broken hyperlinks (using the -c flag) Output as text or JSON (using the -j flag) Extract the PDF text (using the --text flag)

Use as command-line tool or Python package Works with local and online pdfs

Getting Started

Grab a copy of the code with snap or pip, and run it:

snap install linkrot

pip install -e git+https://github.com/marshalmiller/linkrot.git#egg=linkrot ...

$ linkrot Run linkrot -h to see the help output:

$ linkrot -h usage: linkrot [-h] [-d OUTPUT_DIRECTORY] [-c] [-j] [-v] [-t] [-o OUTPUT_FILE] [--version] pdf

Extract metadata and references from a PDF, and optionally download all referenced PDFs.

positional arguments: pdf Filename or URL of a PDF file

optional arguments: -h, --help show this help message and exit -d OUTPUT_DIRECTORY, --download-pdfs OUTPUT_DIRECTORY Download all referenced PDFs into specified directory -c, --check-links Check for broken links -j, --json Output infos as JSON (instead of plain text) -v, --verbose Print all references (instead of only PDFs) -t, --text Only extract text (no metadata or references) -o OUTPUT_FILE, --output-file OUTPUT_FILE Output to specified file instead of console --version show program's version number and exit Examples Lets take a look at this paper: https://weakdh.org/imperfect-forward-secrecy.pdf:

https://weakdh.org/imperfect-forward-secrecy.pdf Document infos:

  • CreationDate = D:20150821110623-04'00'
  • Creator = LaTeX with hyperref package
  • ModDate = D:20150821110805-04'00'
  • PTEX.Fullbanner = This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea version 6.1.1
  • Pages = 13
  • Producer = pdfTeX-1.40.14
  • Title = Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice
  • Trapped = False
  • dc = {'title': {'x-default': 'Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice'}, 'creator': [None], 'description': {'x-default': None}, 'format': 'application/pdf'}
  • pdf = {'Keywords': None, 'Producer': 'pdfTeX-1.40.14', 'Trapped': 'False'}
  • linkrot = {'PTEX.Fullbanner': 'This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea version 6.1.1'}
  • xap = {'CreateDate': '2015-08-21T11:06:23-04:00', 'ModifyDate': '2015-08-21T11:08:05-04:00', 'CreatorTool': 'LaTeX with hyperref package', 'MetadataDate': '2015-08-21T11:08:05-04:00'}
  • xapmm = {'InstanceID': 'uuid:4e570f88-cd0f-4488-85ad-03f4435a4048', 'DocumentID': 'uuid:98988d37-b43d-4c1a-965b-988dfb2944b6'}

References: 36

  • URL: 18
  • PDF: 18

PDF References:

You can use the -v flag to output all references instead of just the PDFs.

Download all referenced pdfs with -d (for download-pdfs) to the specified directory (eg. to /tmp/):

$ linkrot https://weakdh.org/imperfect-forward-secrecy.pdf -d /tmp/ ... To extract text, you can use the -t flag:

Extract text to console

$ linkrot https://weakdh.org/imperfect-forward-secrecy.pdf -t

Extract text to file

$ linkrot https://weakdh.org/imperfect-forward-secrecy.pdf -t -o pdf-text.txt To check for broken links use the -c flag:

#Check Links $ linkrot https://weakdh.org/imperfect-forward-secrecy.pdf -c

Usage as Python library

import linkrot pdf = linkrot.linkrot("filename-or-url.pdf") metadata = pdf.get_metadata() references_list = pdf.get_references() references_dict = pdf.get_references_as_dict() pdf.download_pdfs("target-directory")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

linkrot-0.0.1-py3-none-any.whl (5.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page