Skip to main content

A library for manipulating PDF content streams.

Project description

pdfbeaver

A context-aware PDF content stream editor.

Coverage Tests Python

beaver: an animal which manipulates water streams.

pdfbeaver: a library which manipulates PDF content streams.

pdfbeaver bridges the gap between reading PDFs (calculating text positions, tracking graphics state) and writing PDFs (injecting operators, removing content). Using pdfbeaver, you can easily write pdf content stream filters which are aware of "where you are on the page" at any given moment inside the content stream.

Example applications:

  • change colors of PDF text and vector graphics
  • redact PDF text content without disrupting the rest of the text
  • optimize vector paths in PDF graphics
  • replace fonts in a PDF file

It is built on top of pikepdf (and qpdf) for PDF writing/manipulation and pdfminer.six for stream parsing and state tracking.

🚀 Key Features

  • User-friendly API: register stream editing methods using decorators.
  • Context-Aware Editing: Modify operators based on the current graphics state (Font, Color, Matrix, CTM).
  • Safe Recursion: Automatically traverses and modifies Form XObjects, ensuring nested content is treated exactly like page content.
  • State Tracking: Tracks the cursor position ($x, y$) and transformation matrices ($Tm, CTM$) as you parse.
  • Peephole Optimization: Includes passes to remove dead stores (unused graphics state updates) to keep output files small.

📦 Installation

pip install pdfbeaver

(Note: Requires pikepdf and pdfminer.six)

⚡ Quick Start

1. Simple Operator Replacement

Change all text color to Red.

import pikepdf
import pdfbeaver

pdf = pikepdf.open("input.pdf")

@pdfbeaver.register("Tj", "TJ", "'", '"')
def make_text_red(op, operands, raw_bytes):
    # Return a sequence of instructions:
    # 1. Set RGB color to Red (1, 0, 0)
    # 2. Draw the original text
    return [
        ([1, 0, 0], "rg"),  # Non-stroking red
        ([1, 0, 0], "RG"),  # Stroking red
        raw_bytes           # Original text op
    ]

pdfbeaver.process(pdf)
pdf.save("output_red.pdf")

2. Context-Aware Modification (Redaction)

Delete text only if it appears in the top-left quadrant of the page.

@pdfbeaver.register("Tj", "TJ")
def delete_top_left(context):
    x, y = pdfbeaver.extract_text_position(context.pre_input)[:2]
    if x < 300 and y > 400:
        return None
    return pdfbeaver.UNCHANGED # Pass through unchanged

Flexible Signatures

The @register decorator inspects your function signature. You can include any of the following arguments in any order:

  • operands (or args): List of arguments for the operator.
  • operator (or op): The operator string (e.g. "Tj").
  • raw_bytes: The original binary data for this instruction.
  • context: The StreamContext object.
  • pdf: The pikepdf.Pdf document.
  • page: The pikepdf.Page object.

🏗 Architecture

pdfbeaver solves the problem of mapping input geometry to output streams incrementally, allowing state to be interrogated mid-stream.

graph LR
    A[Input Stream] --> B[StreamStateIterator];
    B --> C{State Tracker};
    C --> D[Handler Registry];
    D --> E[Stream Editor];
    E --> F[Optimizer];
    F --> G[Output Stream];
  1. StreamStateIterator: Wraps pdfminer to interpret the stream byte-by-byte, updating a virtual graphics state (Matrices, Fonts).
  2. HandlerRegistry: Intercepts specific operators defined by the user.
  3. StreamEditor: Recompiles the stream. It injects modified operators or passes original raw bytes for maximum speed and fidelity.
  4. Optimizer: Runs a post-processing pass to clean up redundant operators (e.g., 1 0 0 rg followed immediately by 0 1 0 rg).

📚 Advanced Usage

The StreamContext

Every handler receives a context object containing:

  • context.tracker: The active state tracker (access gstate, textstate, get_current_user_pos()).
  • context.page: The pikepdf.Page object currently being processed.
  • context.container: The specific object being processed (could be a Page or a Form XObject).

See docs/ for documentation. (Hopefully this will appear on readthedocs some day.)

📄 License

MIT License. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfbeaver-0.1.1.tar.gz (51.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdfbeaver-0.1.1-py3-none-any.whl (31.1 kB view details)

Uploaded Python 3

File details

Details for the file pdfbeaver-0.1.1.tar.gz.

File metadata

  • Download URL: pdfbeaver-0.1.1.tar.gz
  • Upload date:
  • Size: 51.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pdfbeaver-0.1.1.tar.gz
Algorithm Hash digest
SHA256 bf42fb25eddd40e01d81c3f1bab5b22ecf23929c5887c5156796b7e2585610dc
MD5 36e57e1d5c86829cacfad0be635a3092
BLAKE2b-256 a71e0b383e4cc92c2671456f8de06423fec6a38c2d651df2d94d6030e7f947a9

See more details on using hashes here.

File details

Details for the file pdfbeaver-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: pdfbeaver-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 31.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pdfbeaver-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ddd2a63a6389e02429230ab94d28832916db559b3ebc2f0038e93b52d792d984
MD5 eedee95c1f67b29dd39d5a49d6d59df0
BLAKE2b-256 270675e7eaadc14c93e07cfa37954622b878f7ede0ebb6804c0e3a9ee6daaafe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page