A library for manipulating PDF content streams.
Project description
pdfbeaver
A context-aware PDF content stream editor.
beaver: an animal which manipulates water streams.
pdfbeaver: a library which manipulates PDF content streams.
pdfbeaver bridges the gap between reading PDFs (calculating text positions, tracking graphics state) and writing PDFs (injecting operators, removing content). Using pdfbeaver, you can easily write pdf content stream filters which are aware of "where you are on the page" at any given moment inside the content stream.
Example applications:
- change colors of PDF text and vector graphics
- redact PDF text content without disrupting the rest of the text
- optimize vector paths in PDF graphics
- replace fonts in a PDF file
It is built on top of pikepdf (and qpdf) for PDF writing/manipulation and pdfminer.six for stream parsing and state tracking.
🚀 Key Features
- User-friendly API: register stream editing methods using decorators.
- Context-Aware Editing: Modify operators based on the current graphics state (Font, Color, Matrix, CTM).
- Safe Recursion: Automatically traverses and modifies Form XObjects, ensuring nested content is treated exactly like page content.
- State Tracking: Tracks the cursor position ($x, y$) and transformation matrices ($Tm, CTM$) as you parse.
- Peephole Optimization: Includes passes to remove dead stores (unused graphics state updates) to keep output files small.
📦 Installation
pip install pdfbeaver
(Note: Requires pikepdf and pdfminer.six)
⚡ Quick Start
1. Simple Operator Replacement
Change all text color to Red.
import pikepdf
import pdfbeaver
pdf = pikepdf.open("input.pdf")
@pdfbeaver.register("Tj", "TJ", "'", '"')
def make_text_red(op, operands, raw_bytes):
# Return a sequence of instructions:
# 1. Set RGB color to Red (1, 0, 0)
# 2. Draw the original text
return [
([1, 0, 0], "rg"), # Non-stroking red
([1, 0, 0], "RG"), # Stroking red
raw_bytes # Original text op
]
pdfbeaver.process(pdf)
pdf.save("output_red.pdf")
2. Context-Aware Modification (Redaction)
Delete text only if it appears in the top-left quadrant of the page.
@pdfbeaver.register("Tj", "TJ")
def delete_top_left(context):
x, y = pdfbeaver.extract_text_position(context.pre_input)[:2]
if x < 300 and y > 400:
return None
return pdfbeaver.UNCHANGED # Pass through unchanged
Flexible Signatures
The @register decorator inspects your function signature. You can include any of the following arguments in any order:
operands(orargs): List of arguments for the operator.operator(orop): The operator string (e.g. "Tj").raw_bytes: The original binary data for this instruction.context: TheStreamContextobject.pdf: Thepikepdf.Pdfdocument.page: Thepikepdf.Pageobject.
🏗 Architecture
pdfbeaver solves the problem of mapping input geometry to output streams incrementally, allowing state to be interrogated mid-stream.
graph LR
A[Input Stream] --> B[StreamStateIterator];
B --> C{State Tracker};
C --> D[Handler Registry];
D --> E[Stream Editor];
E --> F[Optimizer];
F --> G[Output Stream];
- StreamStateIterator: Wraps
pdfminerto interpret the stream byte-by-byte, updating a virtual graphics state (Matrices, Fonts). - HandlerRegistry: Intercepts specific operators defined by the user.
- StreamEditor: Recompiles the stream. It injects modified operators or passes original raw bytes for maximum speed and fidelity.
- Optimizer: Runs a post-processing pass to clean up redundant operators (e.g.,
1 0 0 rgfollowed immediately by0 1 0 rg).
📚 Advanced Usage
The StreamContext
Every handler receives a context object containing:
context.tracker: The active state tracker (accessgstate,textstate,get_current_user_pos()).context.page: Thepikepdf.Pageobject currently being processed.context.container: The specific object being processed (could be a Page or a Form XObject).
See docs/ for documentation. (Hopefully this will appear on readthedocs some day.)
📄 License
MIT License. See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdfbeaver-0.1.1.tar.gz.
File metadata
- Download URL: pdfbeaver-0.1.1.tar.gz
- Upload date:
- Size: 51.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf42fb25eddd40e01d81c3f1bab5b22ecf23929c5887c5156796b7e2585610dc
|
|
| MD5 |
36e57e1d5c86829cacfad0be635a3092
|
|
| BLAKE2b-256 |
a71e0b383e4cc92c2671456f8de06423fec6a38c2d651df2d94d6030e7f947a9
|
File details
Details for the file pdfbeaver-0.1.1-py3-none-any.whl.
File metadata
- Download URL: pdfbeaver-0.1.1-py3-none-any.whl
- Upload date:
- Size: 31.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ddd2a63a6389e02429230ab94d28832916db559b3ebc2f0038e93b52d792d984
|
|
| MD5 |
eedee95c1f67b29dd39d5a49d6d59df0
|
|
| BLAKE2b-256 |
270675e7eaadc14c93e07cfa37954622b878f7ede0ebb6804c0e3a9ee6daaafe
|