A modern pure-python library for reading PDF files
Project description
A modern pure-Python library for reading PDF files.
The goal is to have a modern interface to handle PDF files which is consistent with itself and typical Python syntax.
The library should be Python-only (hence no C-extensions), but allow to change the backend. Similar in concept to matplotlib backends and Keras backends.
The default backend could be PyPDF2.
Possible other backends could be PyMuPDF (using MuPDF) and PikePDF (using QPDF).
WARNING: This library is UNSTABLE at the moment! Expect many changes!
Installation
pip install pdffile
Usage
Retrieve Metadata
>>> import pdf
>>> doc = pdf.PdfFile("001-trivial/minimal-document.pdf")
>>> len(doc)
1
>>> doc.metadata
{'/Producer': 'pdfTeX-1.40.23', '/Creator': 'TeX', '/CreationDate': "D:20220403180542+02'00'", '/ModDate': "D:20220403180542+02'00'", '/Trapped': '/False', '/PTEX.Fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.23 (TeX Live 2021) kpathsea version 6.3.3'}
Extract Text
>>> import pdf
>>> doc = pdf.PdfFile("001-trivial/minimal-document.pdf")
>>> doc[0]
<pdf.PdfPage object at 0x7f72d2b04100>
>>> doc[0].text
'Loremipsumdolorsitamet,consetetursadipscingelitr,seddiamnonumyeirmod\ntemporinviduntutlaboreetdoloremagnaaliquyamerat,seddiamvoluptua.Atvero\neosetaccusametjustoduodoloresetearebum.Stetclitakasdgubergren,noseataki-\nmatasanctusestLoremipsumdolorsitamet.Loremipsumdolorsitamet,consetetur\nsadipscingelitr,seddiamnonumyeirmodtemporinviduntutlaboreetdoloremagna\naliquyamerat,seddiamvoluptua.Atveroeosetaccusametjustoduodoloresetea\nrebum.Stetclitakasdgubergren,noseatakimatasanctusestLoremipsumdolorsit\namet.\n1\n'
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pdffile-0.0.3.tar.gz
(4.3 kB
view hashes)