Skip to main content

PDF Sentinel is a lightweight safety inspection library for PDF documents. It detects oversized, vector-heavy, or otherwise resource-intensive pages (like blueprints) that could slow down or crash OCR and pipelines.

Project description

PDF Sentinel

PDF Sentinel is a lightweight safety inspection library for PDF documents. It detects oversized, vector-heavy, or resource-intensive pages (like blueprints) that can slow down or crash OCR, rendering, or document-processing pipelines.

It is designed as a pre-flight guard before expensive operations such as OCR, Vision-LLM inference, rasterization, or downstream document pipelines.


Features

  • Detects risky PDF pages:
    • Oversized page dimensions (A0, engineering drawings, blueprints)
    • Large embedded images
    • Vector-heavy pages (architectural / CAD drawings)
    • Pages with excessive rasterization cost
  • Page-level and file-level analysis
  • Two parallel safety models:
    • Default (configurable, conservative)
    • Advanced (tuned, risk-based)
  • JSON output for API integration

Installation

pip install pdfsentinel

Quick Start

from pdfsentinel import PDFSentinel

sentinel = PDFSentinel()

print(sentinel.is_file_safe("samples/test.pdf"))
print(sentinel.is_page_safe("samples/test.pdf", 1, json_response=True))

Outputs

PDF Sentinel returns Python dicts by default. If json_response=True, the same structure is returned as a JSON string.

All pages include both verdicts:

  • is_page_safety + errors (default model)
  • is_page_safety_advanced + errors_advanced (advanced model)

File-level analysis includes summary strings so you can quickly see which pages failed:

  • unsafe_pages (comma-separated page numbers for default model)
  • unsafe_pages_advanced (comma-separated page numbers for advanced model)

No error aggregation is done at the file root; reasons live on each page result.


Public API

1) file_analysis(file_path, config=None, json_response=False, timeout_seconds=None)

Runs a full scan of all pages and returns per-page results.

PDF Sentinel applies a 30 second document load timeout by default. You can override it per call with timeout_seconds. If the analysis exceeds the active timeout, the file is marked unsafe.

Returns (dict / JSON):

{
  "file_name": "test.pdf",
  "pages": 2,
  "is_file_safety": false,
  "unsafe_pages": "2",
  "is_file_safety_advanced": true,
  "unsafe_pages_advanced": "",
  "results": [
    {
      "page": 1,
      "is_page_safety": true,
      "errors": [],
      "is_page_safety_advanced": true,
      "errors_advanced": [],
      "metrics": {
        "physical": {},
        "images": [],
        "vector": {},
        "text": {}
      },
      "summary": {
        "page_width_pt": 612.0,
        "page_height_pt": 792.0,
        "max_embedded_image_pixels": 0,
        "vector_path_count": 58,
        "raster_estimate_pixels_300dpi": 8415000
      }
    }
  ]
}

2) page_analysis(file_path, page, config=None, json_response=False, timeout_seconds=None)

Runs a detailed scan of a single page (1-based index).

Returns (dict / JSON):

{
  "file_name": "test.pdf",
  "page": 2,
  "is_page_safety": false,
  "errors": [
    "raster_estimate_too_big:77760000"
  ],
  "is_page_safety_advanced": true,
  "errors_advanced": [],
  "metrics": {
    "physical": {},
    "images": [],
    "vector": {},
    "text": {}
  },
  "summary": {
    "page_width_pt": 2592.0,
    "page_height_pt": 1728.0,
    "max_embedded_image_pixels": 354652,
    "vector_path_count": 33035,
    "raster_estimate_pixels_300dpi": 77760000
  }
}

If the page index is invalid, the method returns:

{
  "file_name": "test.pdf",
  "page": 999,
  "is_page_safety": false,
  "errors": ["invalid_page:999"],
  "is_page_safety_advanced": false,
  "errors_advanced": ["invalid_page:999"],
  "metrics": { "physical": {}, "images": [], "vector": {}, "text": {} },
  "summary": {
    "page_width_pt": 0.0,
    "page_height_pt": 0.0,
    "max_embedded_image_pixels": 0,
    "vector_path_count": 0,
    "raster_estimate_pixels_300dpi": 0
  }
}

page_analysis(...) also uses the default 30 second timeout unless you override it with timeout_seconds. If the analysis exceeds the active timeout, the page is returned as unsafe with a document_load_timeout:<seconds>s error.


3) is_file_safe(file_path, config=None, json_response=False, timeout_seconds=None)

Convenience method that returns only the unsafe pages (default + advanced). Useful for fast checks or CLI output.

Returns (dict / JSON):

{
  "file_name": "test.pdf",
  "pages": 2,
  "is_file_safety": false,
  "unsafety_pages": [
    {
      "page": 2,
      "errors": [
        "raster_estimate_too_big:77760000"
      ]
    }
  ],
  "is_file_safety_advanced": true,
  "unsafety_pages_advanced": []
}

4) is_page_safe(file_path, page, config=None, json_response=False, timeout_seconds=None)

Convenience method for a single page. This currently returns the same structure as page_analysis(...).

Returns (dict / JSON):

{
  "file_name": "test.pdf",
  "page": 2,
  "is_page_safety": false,
  "errors": [
    "raster_estimate_too_big:77760000"
  ],
  "is_page_safety_advanced": true,
  "errors_advanced": []
}

Configuration (Default Model Only)

You can override default safety thresholds per call. The advanced model is tuned and not intended to be overridden at runtime.

sentinel.is_file_safe(
    "samples/test.pdf",
    config={
        "max_page_size": 1800,
        "max_image_pixels": 10_000_000,
        "max_vectors_operations": 1000,
        "max_raster_pixels": 20_000_000,
        "document_load_timeout_seconds": 30
    }
)

PDF Sentinel now applies a default document load timeout automatically. You can also override it per call when needed:

sentinel.is_file_safe("samples/test.pdf", timeout_seconds=30)
sentinel.page_analysis("samples/test.pdf", 1, timeout_seconds=30)
Parameter Default Description
max_page_size 2000 Max page dimension in points (pt)
max_image_pixels 20000000 Max pixels for a single embedded image
max_vectors_operations 1500 Max allowed vector drawing operations
max_raster_pixels 30000000 Estimated raster size (300 DPI)
document_load_timeout_seconds 30 Max seconds allowed to open/analyze a PDF

Advanced Safety Model (How it Works)

PDF Sentinel includes an advanced safety model that runs in parallel with the default rules.

While the default model focuses on conservative limits (page size, vector count, raster estimates), the advanced model is risk-based and tuned using real-world performance data from PDF rendering pipelines.

The advanced model flags a page as unsafe if any of the following conditions are met:

Extreme physical size Pages whose largest dimension exceeds a hard threshold are likely to cause excessive render time, regardless of content.

Very wide pages Unusually wide pages (common in blueprints and engineering drawings) tend to stress rasterization and memory allocation.

Raster fan-out Pages containing many embedded images with large combined pixel counts (and no soft masks) are strong indicators of memory pressure and CPU spikes during rendering.

Conceptually, the advanced decision is a simple OR gate over these risk signals:

dangerous = render_risk OR rss_risk

Where:

render_risk is driven primarily by physical page dimensions

rss_risk is driven by raster fan-out and total pixel pressure

The advanced model is intentionally not configurable at runtime. Its thresholds are pre-tuned and designed to be stable, predictable, and comparable across environments.

This makes it ideal for:

Early rejection of pathological PDFs

Protecting OCR and AI pipelines from worst-case inputs

Fast, deterministic safety decisions at scale

You are free to rely on the default model, the advanced model, or both — depending on how strict your pipeline needs to be.

License

MIT License © 2025 — Not Empty Foundation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfsentinel-3.0.1.tar.gz (21.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdfsentinel-3.0.1-py3-none-any.whl (21.5 kB view details)

Uploaded Python 3

File details

Details for the file pdfsentinel-3.0.1.tar.gz.

File metadata

  • Download URL: pdfsentinel-3.0.1.tar.gz
  • Upload date:
  • Size: 21.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdfsentinel-3.0.1.tar.gz
Algorithm Hash digest
SHA256 7cc7b69813f945a89018ab8569f93a7feb56a72c3b36e3c283f8f51d173d1539
MD5 985dac740a075a530c59b81e7a05b90e
BLAKE2b-256 0f11118047b0b3d90dfe4f100b7753cc812be7d5da3ef5a4dda77f9b042cfe4f

See more details on using hashes here.

File details

Details for the file pdfsentinel-3.0.1-py3-none-any.whl.

File metadata

  • Download URL: pdfsentinel-3.0.1-py3-none-any.whl
  • Upload date:
  • Size: 21.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdfsentinel-3.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ceb6c5559dda1af584bb3afba1288695eb299af055783683a2d451f0a082f0e4
MD5 6b785a704258b82b5a061f0e45abe9dd
BLAKE2b-256 91075d2189e23bb29704db7c126579d1213fbc4032567b7ceb360ca29a507453

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page