A Python PDF parsing library and tool built on top to browse the internal structure of a PDF file
Project description
PDFSyntax
A Python PDF parsing library and tool built on top to browse the internal structure of a PDF file
Introduction
The project is focused on chapter 7 ("Syntax") of the Portable Document Format (PDF) Specification.
PDFSyntax is lightweight (no dependencies) and written from scratch in pure Python.
- CLI: It started as a command-line interface to inspect the internal structure of a PDF file.
- API: Now the internal functions are being exposed as a toolkit for PDF read/write operations.
Project status
WORK IN PROGRESS! This is ALPHA quality software. The API may change anytime.
Design
PDFSyntax favors non-destructive edits allowed by the PDF Specification: by default incremental updates are added at the end of the original file.
It is mostly made of simple functions working on built-in types and named tuples. Shallow copying of the Doc object structure performed by pure functions offers some kind of - experimental - immutability.
Installation
You can install from PyPI:
pip install pdfsyntax
CLI overview
Please refer to the CLI README for details.
The general form of the CLI usage is:
python3 -m pdfsyntax COMMAND FILE
You can get quick insights on a PDF file with these commands:
overview
outputs text data about the structure and the metadata.inspect
outputs static html data that lets you browse the internal structure of the PDF file: the PDF source is pretty-printed and augmented with hyperlinks.
API overview
Please refer to the API README for details.
PDFSyntax is mostly made of simple functions. Example:
>>> from pdfsyntax import read, metadata
>>> doc = read("samples/simple_text_string.pdf")
>>> metadata(doc) #returns a Python dict whose keys are 'Title', 'Author', 'Subject', etc...
The Doc object is probably the only dedicated class you will need to handle. It is a black box that stores all the internal states of a document:
- content that is cached/memoized from an original file,
- modifications that add/modifiy/delete content and that are tracked as incremental updates.
>>> doc
<PDF Doc with 1 revisions(s), ready to start update/revision 2, cache loaded with 0 / 7 objects>
This object exposes as a method the same metadata function, therefore you can get the same result with:
>>> doc.metadata() #returns a Python dict whose keys are 'Title', 'Author', 'Subject', etc...
Low-level functions like get_object
or update_object
allow you to directly access and manipulate the inner objects of the document structure.
You may also use higher-level functions like rotate
:
>>> from pdfsyntax import rotate, write
>>> doc180 = rotate(doc, 180) #rotate pages by 180°
The orignal object is unchanged and a new object is created with an incremental update (revision 2) that encloses the ongoing orientation modification:
>>> doc180
<PDF Doc with 2 revisions(s), current update/revision containing 1 modifications, cache loaded with 3 / 7 objects>
You then can write the modified PDF to disk. Note that the resulting file contains a new section appended to the original content. You may cut this section to revert the change.
>>> write(doc180, "rotated_doc.pdf")
Open-Source, not Open-Contribution yet
PDFSyntax is MIT licensed but is currently closed to contributions.
Personal note: this is a pet projet of mine and my time is limited. First I need to focus on my roadmap (new features and refactoring) and then I will happily accept contributions when everything is a little more stabilised.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pdfsyntax-0.0.7-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2a2e6cb88406affef384042c7c3c6f8b4e2bfb58976b3ef656f2474430504f92 |
|
MD5 | 34990e6430d3f75d7d72885a8e9d18b3 |
|
BLAKE2b-256 | 9393e9a2a10e7cfa78de0f60cee01614688f35e7a1fc9d28724ebdf0a2650386 |