Skip to main content

Extract metadata and URLs from PDF files, and download all referenced PDFs

Project description

PDFx

Build status for master branch image image

Introduction

Extract references (pdf, url, doi, arxiv) and metadata from a PDF. Optionally download all referenced PDFs and check for broken links.

Features

  • Extract references and metadata from a given PDF
  • Detects pdf, url, arxiv and doi references
  • Fast, parallel download of all referenced PDFs
  • Find broken hyperlinks (using the -c flag) (more)
  • Output as text or JSON (using the -j flag)
  • Extract the PDF text (using the --text flag)
  • Use as command-line tool or Python package
  • Compatible with Python 2 and 3
  • Works with local and online pdfs

Getting Started

Grab a copy of the code with easy_install or pip, and run it:

$ sudo easy_install -U pdfx
...
$ pdfx <pdf-file-or-url>

Run pdfx -h to see the help output:

$ pdfx -h
usage: pdfx [-h] [-d OUTPUT_DIRECTORY] [-c] [-j] [-v] [-t] [-o OUTPUT_FILE]
            [--version]
            pdf

Extract metadata and references from a PDF, and optionally download all
referenced PDFs. Visit https://www.metachris.com/pdfx for more information.

positional arguments:
  pdf                   Filename or URL of a PDF file

optional arguments:
  -h, --help            show this help message and exit
  -d OUTPUT_DIRECTORY, --download-pdfs OUTPUT_DIRECTORY
                        Download all referenced PDFs into specified directory
  -c, --check-links     Check for broken links
  -j, --json            Output infos as JSON (instead of plain text)
  -v, --verbose         Print all references (instead of only PDFs)
  -t, --text            Only extract text (no metadata or references)
  -o OUTPUT_FILE, --output-file OUTPUT_FILE
                        Output to specified file instead of console
  --version             show program's version number and exit

Examples

Lets take a look at this paper: https://weakdh.org/imperfect-forward-secrecy.pdf:

$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf
Document infos:
- CreationDate = D:20150821110623-04'00'
- Creator = LaTeX with hyperref package
- ModDate = D:20150821110805-04'00'
- PTEX.Fullbanner = This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea version 6.1.1
- Pages = 13
- Producer = pdfTeX-1.40.14
- Title = Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice
- Trapped = False
- dc = {'title': {'x-default': 'Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice'}, 'creator': [None], 'description': {'x-default': None}, 'format': 'application/pdf'}
- pdf = {'Keywords': None, 'Producer': 'pdfTeX-1.40.14', 'Trapped': 'False'}
- pdfx = {'PTEX.Fullbanner': 'This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea version 6.1.1'}
- xap = {'CreateDate': '2015-08-21T11:06:23-04:00', 'ModifyDate': '2015-08-21T11:08:05-04:00', 'CreatorTool': 'LaTeX with hyperref package', 'MetadataDate': '2015-08-21T11:08:05-04:00'}
- xapmm = {'InstanceID': 'uuid:4e570f88-cd0f-4488-85ad-03f4435a4048', 'DocumentID': 'uuid:98988d37-b43d-4c1a-965b-988dfb2944b6'}

References: 36
- URL: 18
- PDF: 18

PDF References:
- http://www.spiegel.de/media/media-35533.pdf
- http://www.spiegel.de/media/media-35513.pdf
- http://www.spiegel.de/media/media-35509.pdf
- http://www.spiegel.de/media/media-35529.pdf
- http://www.spiegel.de/media/media-35527.pdf
- http://cr.yp.to/factorization/smoothparts-20040510.pdf
- http://www.spiegel.de/media/media-35517.pdf
- http://www.spiegel.de/media/media-35526.pdf
- http://www.spiegel.de/media/media-35519.pdf
- http://www.spiegel.de/media/media-35522.pdf
- http://cryptome.org/2013/08/spy-budget-fy13.pdf
- http://www.spiegel.de/media/media-35515.pdf
- http://www.spiegel.de/media/media-35514.pdf
- http://www.hyperelliptic.org/tanja/SHARCS/talks06/thorsten.pdf
- http://www.spiegel.de/media/media-35528.pdf
- http://www.spiegel.de/media/media-35671.pdf
- http://www.spiegel.de/media/media-35520.pdf
- http://www.spiegel.de/media/media-35551.pdf

You can use the -v flag to output all references instead of just the PDFs.

Download all referenced pdfs with -d (for download-pdfs) to the specified directory (eg. to /tmp/):

$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf -d /tmp/
...

To extract text, you can use the -t flag:

# Extract text to console
$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf -t

# Extract text to file
$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf -t -o pdf-text.txt

To check for broken links use the -c flag:

$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf -c

[Example (with video) of checking for broken links](https://www.metachris.com/2016/03/find-broken-hyperlinks-in-a-pdf-document-with-pdfx/).

Usage as Python library

>>> import pdfx
>>> pdf = pdfx.PDFx("filename-or-url.pdf")
>>> metadata = pdf.get_metadata()
>>> references_list = pdf.get_references()
>>> references_dict = pdf.get_references_as_dict()
>>> pdf.download_pdfs("target-directory")

Dev & Contributing

# Setup venv
python3 -m venv
venv . venv/bin/activate

# Install PDFx and dev deps
pip install -e .
pip install -r requirements_dev.txt

# Run tests and checks
make test
make lint
make check

# Format the code (with black)
make format

Releasing

  • Update version number in setup.py and pdfx/__init__.py
  • Create a git tag starting with v (eg. git tag v1.5.9)
  • Push the tag to GitHub: git push --tags

GitHub Actions is then publishing to PyPI.

Various

Feedback, ideas and pull requests are welcome!

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfx-1.4.1.tar.gz (1.5 MB view details)

Uploaded Source

Built Distribution

pdfx-1.4.1-py2.py3-none-any.whl (21.2 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file pdfx-1.4.1.tar.gz.

File metadata

  • Download URL: pdfx-1.4.1.tar.gz
  • Upload date:
  • Size: 1.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4

File hashes

Hashes for pdfx-1.4.1.tar.gz
Algorithm Hash digest
SHA256 9f2e71e58c42a76f1900d79e9357e4fcee8dd4358086c94e6b95e1fb1f5aa434
MD5 94550e11587498b5fe3c19110890402b
BLAKE2b-256 4522d0cb33e87a2e0833942fe751493e4f6a9254578bb54bb166fc684c2b8105

See more details on using hashes here.

File details

Details for the file pdfx-1.4.1-py2.py3-none-any.whl.

File metadata

  • Download URL: pdfx-1.4.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 21.2 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4

File hashes

Hashes for pdfx-1.4.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 f4a1cd631a105150f4938428b836d9e5102fbf77504e349f05404ad9cbb26e38
MD5 c6c6d57ae650cb3a82e606e14324c08d
BLAKE2b-256 4b40dad6848bf9886be5b3a297a57401f5e0ab04e0b81e158bbe74db62f1d377

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page