Extract metadata and URLs from PDF files, and download all referenced PDFs
Project description
PDFx
Introduction
Extract references (pdf, url, doi, arxiv) and metadata from a PDF. Optionally download all referenced PDFs and check for broken links.
Features
- Extract references and metadata from a given PDF
- Detects pdf, url, arxiv and doi references
- Fast, parallel download of all referenced PDFs
- Find broken hyperlinks (using the
-c
flag) (more) - Output as text or JSON (using the
-j
flag) - Extract the PDF text (using the
--text
flag) - Use as command-line tool or Python package
- Compatible with Python 2 and 3
- Works with local and online pdfs
Getting Started
Grab a copy of the code with easy_install
or pip
, and run it:
$ sudo easy_install -U pdfx
...
$ pdfx <pdf-file-or-url>
Run pdfx -h
to see the help output:
$ pdfx -h
usage: pdfx [-h] [-d OUTPUT_DIRECTORY] [-c] [-j] [-v] [-t] [-o OUTPUT_FILE]
[--version]
pdf
Extract metadata and references from a PDF, and optionally download all
referenced PDFs. Visit https://www.metachris.com/pdfx for more information.
positional arguments:
pdf Filename or URL of a PDF file
optional arguments:
-h, --help show this help message and exit
-d OUTPUT_DIRECTORY, --download-pdfs OUTPUT_DIRECTORY
Download all referenced PDFs into specified directory
-c, --check-links Check for broken links
-j, --json Output infos as JSON (instead of plain text)
-v, --verbose Print all references (instead of only PDFs)
-t, --text Only extract text (no metadata or references)
-o OUTPUT_FILE, --output-file OUTPUT_FILE
Output to specified file instead of console
--version show program's version number and exit
Examples
Lets take a look at this paper: https://weakdh.org/imperfect-forward-secrecy.pdf:
$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf
Document infos:
- CreationDate = D:20150821110623-04'00'
- Creator = LaTeX with hyperref package
- ModDate = D:20150821110805-04'00'
- PTEX.Fullbanner = This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea version 6.1.1
- Pages = 13
- Producer = pdfTeX-1.40.14
- Title = Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice
- Trapped = False
- dc = {'title': {'x-default': 'Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice'}, 'creator': [None], 'description': {'x-default': None}, 'format': 'application/pdf'}
- pdf = {'Keywords': None, 'Producer': 'pdfTeX-1.40.14', 'Trapped': 'False'}
- pdfx = {'PTEX.Fullbanner': 'This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea version 6.1.1'}
- xap = {'CreateDate': '2015-08-21T11:06:23-04:00', 'ModifyDate': '2015-08-21T11:08:05-04:00', 'CreatorTool': 'LaTeX with hyperref package', 'MetadataDate': '2015-08-21T11:08:05-04:00'}
- xapmm = {'InstanceID': 'uuid:4e570f88-cd0f-4488-85ad-03f4435a4048', 'DocumentID': 'uuid:98988d37-b43d-4c1a-965b-988dfb2944b6'}
References: 36
- URL: 18
- PDF: 18
PDF References:
- http://www.spiegel.de/media/media-35533.pdf
- http://www.spiegel.de/media/media-35513.pdf
- http://www.spiegel.de/media/media-35509.pdf
- http://www.spiegel.de/media/media-35529.pdf
- http://www.spiegel.de/media/media-35527.pdf
- http://cr.yp.to/factorization/smoothparts-20040510.pdf
- http://www.spiegel.de/media/media-35517.pdf
- http://www.spiegel.de/media/media-35526.pdf
- http://www.spiegel.de/media/media-35519.pdf
- http://www.spiegel.de/media/media-35522.pdf
- http://cryptome.org/2013/08/spy-budget-fy13.pdf
- http://www.spiegel.de/media/media-35515.pdf
- http://www.spiegel.de/media/media-35514.pdf
- http://www.hyperelliptic.org/tanja/SHARCS/talks06/thorsten.pdf
- http://www.spiegel.de/media/media-35528.pdf
- http://www.spiegel.de/media/media-35671.pdf
- http://www.spiegel.de/media/media-35520.pdf
- http://www.spiegel.de/media/media-35551.pdf
You can use the -v
flag to output all references instead of just the
PDFs.
Download all referenced pdfs with -d
(for download-pdfs
) to the
specified directory (eg. to /tmp/
):
$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf -d /tmp/
...
To extract text, you can use the -t
flag:
# Extract text to console
$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf -t
# Extract text to file
$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf -t -o pdf-text.txt
To check for broken links use the -c
flag:
$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf -c
[Example (with video) of checking for broken links](https://www.metachris.com/2016/03/find-broken-hyperlinks-in-a-pdf-document-with-pdfx/).
Usage as Python library
>>> import pdfx
>>> pdf = pdfx.PDFx("filename-or-url.pdf")
>>> metadata = pdf.get_metadata()
>>> references_list = pdf.get_references()
>>> references_dict = pdf.get_references_as_dict()
>>> pdf.download_pdfs("target-directory")
Dev & Contributing
# Setup venv
python3 -m venv
venv . venv/bin/activate
# Install PDFx and dev deps
pip install -e .
pip install -r requirements_dev.txt
# Run tests and checks
make test
make lint
make check
# Format the code (with black)
make format
Releasing
- Update version number in
setup.py
andpdfx/__init__.py
- Create a git tag starting with
v
(eg.git tag v1.5.9
) - Push the tag to GitHub:
git push --tags
GitHub Actions is then publishing to PyPI.
Various
- Author: Chris Hager twitter.com/metachris
- Homepage: https://www.metachris.com/pdfx
- License: Apache
Feedback, ideas and pull requests are welcome!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pdfx-1.4.1.tar.gz
.
File metadata
- Download URL: pdfx-1.4.1.tar.gz
- Upload date:
- Size: 1.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9f2e71e58c42a76f1900d79e9357e4fcee8dd4358086c94e6b95e1fb1f5aa434 |
|
MD5 | 94550e11587498b5fe3c19110890402b |
|
BLAKE2b-256 | 4522d0cb33e87a2e0833942fe751493e4f6a9254578bb54bb166fc684c2b8105 |
File details
Details for the file pdfx-1.4.1-py2.py3-none-any.whl
.
File metadata
- Download URL: pdfx-1.4.1-py2.py3-none-any.whl
- Upload date:
- Size: 21.2 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f4a1cd631a105150f4938428b836d9e5102fbf77504e349f05404ad9cbb26e38 |
|
MD5 | c6c6d57ae650cb3a82e606e14324c08d |
|
BLAKE2b-256 | 4b40dad6848bf9886be5b3a297a57401f5e0ab04e0b81e158bbe74db62f1d377 |