A Python library to inspect and modify the internal structure of a PDF file

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Software Development :: Libraries
- Utilities

Project description

PDFSyntax

A Python library to inspect and modify the internal structure of a PDF file

Introduction

The project is focused on chapter 7 ("Syntax") of the Portable Document Format (PDF) Specification.

PDFSyntax is lightweight (no dependencies) and written from scratch in pure Python.

CLI: It started as a command-line interface to inspect the internal structure of a PDF file.
API: Now the internal functions are being exposed as a toolkit for PDF read/write operations.

Project status

WORK IN PROGRESS! This is ALPHA quality software. The API may change anytime. Next on TO-DO list:

Cut & append pages
Lossless compression
More filters
Improve text extraction
Augment text extraction with layout detection

Design

PDFSyntax favors non-destructive edits allowed by the PDF Specification: by default incremental updates are added at the end of the original file.

It is mostly made of simple functions working on built-in types and named tuples. Shallow copying of the Doc object structure performed by pure functions offers some kind of - experimental - immutability.

Installation

You can install from PyPI:

pip install pdfsyntax

CLI overview

Please refer to the CLI README for details.

The general form of the CLI usage is:

python3 -m pdfsyntax COMMAND FILE

You can get quick insights on a PDF file with these commands:

overview outputs text data about the structure and the metadata.
browse outputs static html data that lets you browse the internal structure of the PDF file: the PDF source is pretty-printed and augmented with hyperlinks.
text outputs extracted text spatially, as if it was a kind of scan.

API overview

Please refer to the API README for details.

PDFSyntax is mostly made of simple functions. Example:

>>> from pdfsyntax import readfile, metadata
>>> doc = readfile("samples/simple_text_string.pdf")
>>> metadata(doc) #returns a Python dict whose keys are 'Title', 'Author', 'Subject', etc...

The Doc object is probably the only dedicated class you will need to handle. It is a black box that stores all the internal states of a document:

content that is cached/memoized from an original file,
modifications that add/modifiy/delete content and that are tracked as incremental updates.

>>> doc
<PDF Doc with 1 revisions(s), ready to start update/revision 2, cache loaded with 0 / 7 objects>

This object exposes as a method the same metadata function, therefore you can get the same result with:

>>> doc.metadata() #returns a Python dict whose keys are 'Title', 'Author', 'Subject', etc...

Low-level functions like get_object or update_object allow you to directly access and manipulate the inner objects of the document structure. You may also use higher-level functions like rotate:

>>> from pdfsyntax import rotate, writefile
>>> doc180 = rotate(doc, 180) #rotate pages by 180°

The orignal object is unchanged and a new object is created with an incremental update (revision 2) that encloses the ongoing orientation modification:

>>> doc180
<PDF Doc with 2 revisions(s), current update/revision containing 1 modifications, cache loaded with 3 / 7 objects>

You then can write the modified PDF to disk. Note that the resulting file contains a new section appended to the original content. You may cut this section to revert the change.

>>> writefile(doc180, "rotated_doc.pdf")

Open-Source, not Open-Contribution yet

PDFSyntax is MIT licensed but is currently closed to contributions.

Personal note: this is a pet projet of mine and my time is limited. First I need to focus on my roadmap (new features and refactoring) and then I will happily accept contributions when everything is a little more stabilised.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Software Development :: Libraries
- Utilities

Release history Release notifications | RSS feed

This version

0.1.1

May 11, 2024

0.1.0

Apr 14, 2024

0.0.8

Jan 21, 2024

0.0.7

Sep 2, 2023

0.0.6

Apr 29, 2023

0.0.5

Nov 18, 2022

0.0.4

Sep 24, 2022

0.0.3

Sep 17, 2022

0.0.2

Jul 28, 2021

0.0.1

Jul 17, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfsyntax-0.1.1.tar.gz (38.7 kB view hashes)

Uploaded May 11, 2024 Source

Built Distribution

pdfsyntax-0.1.1-py3-none-any.whl (38.0 kB view hashes)

Uploaded May 11, 2024 Python 3

Hashes for pdfsyntax-0.1.1.tar.gz

Hashes for pdfsyntax-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`93dcfc4e059ccb06a00665cf784f163fa687942dd3067b3e63ef20983952d7d4`
MD5	`af4dea9b91a98768eba15fdaa42b7b8f`
BLAKE2b-256	`2736840b1a2ae7ae02fcba4517752decc8a4fcc913e39b3583683f1647bb498f`

Hashes for pdfsyntax-0.1.1-py3-none-any.whl

Hashes for pdfsyntax-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`893bacb39f109278a4811093ae6742a1d0a0b0c6ecf951c984f5977d3419a083`
MD5	`4f52773088824dd0b4d239b971250220`
BLAKE2b-256	`0d09bd9db0432921680fc5403a041377e56fc0c2875d64fb61ba8890ada6f842`