A modern pure-python library for reading PDF files
Project description
A modern pure-Python library for reading PDF files.
The goal is to have a modern interface to handle PDF files which is consistent with itself and typical Python syntax.
The library should be Python-only (hence no C-extensions), but allow to change the backend. Similar in concept to matplotlib backends and Keras backends.
The default backend could be PyPDF2.
Possible other backends could be PyMuPDF (using MuPDF) and PikePDF (using QPDF).
WARNING: This library is UNSTABLE at the moment! Expect many changes!
Installation
pip install pdffile
Usage
Retrieve Metadata
>>> import pdf
>>> doc = pdf.PdfFile("001-trivial/minimal-document.pdf")
>>> len(doc)
1
>>> doc.metadata
Metadata(
title=None,
producer='pdfTeX-1.40.23',
creator='TeX',
creation_date=datetime.datetime(2022, 4, 3, 18, 5, 42),
modification_date=datetime.datetime(2022, 4, 3, 18, 5, 42)
other={
'/CreationDate': "D:20220403180542+02'00'",
'/ModDate': "D:20220403180542+02'00'",
'/Trapped': '/False',
'/PTEX.Fullbanner': 'This is pdfTeX, V...'})
Encrypted PDFs
If you have an encrypted PDF, just provide the key:
doc = pdf.PdfFile(pdf_path, password=password)
All following operations work just as described.
Get Outline
>>> import pdf
>>> doc = pdf.PdfFile(pdf_path, password=password)
>>> doc.outline
[
Links(page=5, text='1 Header'),
Links(page=5, text='1.1 A section'),
Links(page=9, text='2 Foobar'),
Links(page=108, text='References')
]
Extract Text
>>> import pdf
>>> doc = pdf.PdfFile("001-trivial/minimal-document.pdf")
>>> doc[0]
<pdf.PdfPage object at 0x7f72d2b04100>
>>> doc[0].text
'Loremipsumdolorsitamet,consetetursadipscingelitr,seddiamnonumyeirmod\ntemporinviduntutlaboreetdoloremagnaaliquyamerat,seddiamvoluptua.Atvero\neosetaccusametjustoduodoloresetearebum.Stetclitakasdgubergren,noseataki-\nmatasanctusestLoremipsumdolorsitamet.Loremipsumdolorsitamet,consetetur\nsadipscingelitr,seddiamnonumyeirmodtemporinviduntutlaboreetdoloremagna\naliquyamerat,seddiamvoluptua.Atveroeosetaccusametjustoduodoloresetea\nrebum.Stetclitakasdgubergren,noseatakimatasanctusestLoremipsumdolorsit\namet.\n1\n'
Alternatively, you can use doc.text to get the text of all pages.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdffile-0.0.4.tar.gz.
File metadata
- Download URL: pdffile-0.0.4.tar.gz
- Upload date:
- Size: 5.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
29f45d429c931edd5dea56f96d4ab0d25cb1a2d7749cfbf871423cf19a151e0f
|
|
| MD5 |
1b09381d030cc37048cc7bf7ce686d67
|
|
| BLAKE2b-256 |
45ec441cd1fb8394b47720b4292fb4d3cf76fcd11a85280945d7a3610478d4e6
|
File details
Details for the file pdffile-0.0.4-py3-none-any.whl.
File metadata
- Download URL: pdffile-0.0.4-py3-none-any.whl
- Upload date:
- Size: 5.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
120621e048bc7de771e4a3a4ec4fc72cb87a87f73419ee0a4f10745f71d4a4bb
|
|
| MD5 |
363525878ada7edf9a459c643d9a72f3
|
|
| BLAKE2b-256 |
052c16fe512a37ec939b963043f4fb40d71e464af43eb3bbde74dab6695693d5
|