Skip to main content

PDF Sentinel is a lightweight safety inspection library for PDF documents. It detects oversized, vector-heavy, or otherwise resource-intensive pages (like blueprints) that could slow down or crash OCR and pipelines.

Reason this release was yanked:

timeout is not working

Project description

PDF Sentinel

PDF Sentinel is a lightweight safety inspection library for PDF documents. It detects oversized, vector-heavy, or resource-intensive pages (like blueprints) that can slow down or crash OCR, rendering, or document-processing pipelines.

It is designed as a pre-flight guard before expensive operations such as OCR, Vision-LLM inference, rasterization, or downstream document pipelines.


Features

  • Detects risky PDF pages:
    • Oversized page dimensions (A0, engineering drawings, blueprints)
    • Large embedded images
    • Vector-heavy pages (architectural / CAD drawings)
    • Pages with excessive rasterization cost
  • Page-level and file-level analysis
  • Two parallel safety models:
    • Default (configurable, conservative)
    • Advanced (tuned, risk-based)
  • JSON output for API integration

Installation

pip install pdfsentinel

Quick Start

from pdfsentinel import PDFSentinel

sentinel = PDFSentinel()

print(sentinel.is_file_safe("samples/test.pdf"))
print(sentinel.is_page_safe("samples/test.pdf", 1, json_response=True))

Outputs

PDF Sentinel returns Python dicts by default. If json_response=True, the same structure is returned as a JSON string.

All pages include both verdicts:

  • is_page_safety + errors (default model)
  • is_page_safety_advanced + errors_advanced (advanced model)

File-level analysis includes summary strings so you can quickly see which pages failed:

  • unsafe_pages (comma-separated page numbers for default model)
  • unsafe_pages_advanced (comma-separated page numbers for advanced model)

No error aggregation is done at the file root; reasons live on each page result.


Public API

1) file_analysis(file_path, config=None, json_response=False, timeout_seconds=None)

Runs a full scan of all pages and returns per-page results.

PDF Sentinel applies a 30 second document load timeout by default. You can override it per call with timeout_seconds. If the analysis exceeds the active timeout, the file is marked unsafe.

Returns (dict / JSON):

{
  "file_name": "test.pdf",
  "pages": 2,
  "is_file_safety": false,
  "unsafe_pages": "2",
  "is_file_safety_advanced": true,
  "unsafe_pages_advanced": "",
  "results": [
    {
      "page": 1,
      "is_page_safety": true,
      "errors": [],
      "is_page_safety_advanced": true,
      "errors_advanced": [],
      "metrics": {
        "physical": {},
        "images": [],
        "vector": {},
        "text": {}
      },
      "summary": {
        "page_width_pt": 612.0,
        "page_height_pt": 792.0,
        "max_embedded_image_pixels": 0,
        "vector_path_count": 58,
        "raster_estimate_pixels_300dpi": 8415000
      }
    }
  ]
}

2) page_analysis(file_path, page, config=None, json_response=False, timeout_seconds=None)

Runs a detailed scan of a single page (1-based index).

Returns (dict / JSON):

{
  "file_name": "test.pdf",
  "page": 2,
  "is_page_safety": false,
  "errors": [
    "raster_estimate_too_big:77760000"
  ],
  "is_page_safety_advanced": true,
  "errors_advanced": [],
  "metrics": {
    "physical": {},
    "images": [],
    "vector": {},
    "text": {}
  },
  "summary": {
    "page_width_pt": 2592.0,
    "page_height_pt": 1728.0,
    "max_embedded_image_pixels": 354652,
    "vector_path_count": 33035,
    "raster_estimate_pixels_300dpi": 77760000
  }
}

If the page index is invalid, the method returns:

{
  "file_name": "test.pdf",
  "page": 999,
  "is_page_safety": false,
  "errors": ["invalid_page:999"],
  "is_page_safety_advanced": false,
  "errors_advanced": ["invalid_page:999"],
  "metrics": { "physical": {}, "images": [], "vector": {}, "text": {} },
  "summary": {
    "page_width_pt": 0.0,
    "page_height_pt": 0.0,
    "max_embedded_image_pixels": 0,
    "vector_path_count": 0,
    "raster_estimate_pixels_300dpi": 0
  }
}

page_analysis(...) also uses the default 30 second timeout unless you override it with timeout_seconds. If the analysis exceeds the active timeout, the page is returned as unsafe with a document_load_timeout:<seconds>s error.


3) is_file_safe(file_path, config=None, json_response=False, timeout_seconds=None)

Convenience method that returns only the unsafe pages (default + advanced). Useful for fast checks or CLI output.

Returns (dict / JSON):

{
  "file_name": "test.pdf",
  "pages": 2,
  "is_file_safety": false,
  "unsafety_pages": [
    {
      "page": 2,
      "errors": [
        "raster_estimate_too_big:77760000"
      ]
    }
  ],
  "is_file_safety_advanced": true,
  "unsafety_pages_advanced": []
}

4) is_page_safe(file_path, page, config=None, json_response=False, timeout_seconds=None)

Convenience method for a single page. This currently returns the same structure as page_analysis(...).

Returns (dict / JSON):

{
  "file_name": "test.pdf",
  "page": 2,
  "is_page_safety": false,
  "errors": [
    "raster_estimate_too_big:77760000"
  ],
  "is_page_safety_advanced": true,
  "errors_advanced": []
}

Configuration (Default Model Only)

You can override default safety thresholds per call. The advanced model is tuned and not intended to be overridden at runtime.

sentinel.is_file_safe(
    "samples/test.pdf",
    config={
        "max_page_size": 1800,
        "max_image_pixels": 10_000_000,
        "max_vectors_operations": 1000,
        "max_raster_pixels": 20_000_000,
        "document_load_timeout_seconds": 30
    }
)

PDF Sentinel now applies a default document load timeout automatically. You can also override it per call when needed:

sentinel.is_file_safe("samples/test.pdf", timeout_seconds=30)
sentinel.page_analysis("samples/test.pdf", 1, timeout_seconds=30)
Parameter Default Description
max_page_size 2000 Max page dimension in points (pt)
max_image_pixels 20000000 Max pixels for a single embedded image
max_vectors_operations 1500 Max allowed vector drawing operations
max_raster_pixels 30000000 Estimated raster size (300 DPI)
document_load_timeout_seconds 30 Max seconds allowed to open/analyze a PDF

Advanced Safety Model (How it Works)

PDF Sentinel includes an advanced safety model that runs in parallel with the default rules.

While the default model focuses on conservative limits (page size, vector count, raster estimates), the advanced model is risk-based and tuned using real-world performance data from PDF rendering pipelines.

The advanced model flags a page as unsafe if any of the following conditions are met:

Extreme physical size Pages whose largest dimension exceeds a hard threshold are likely to cause excessive render time, regardless of content.

Very wide pages Unusually wide pages (common in blueprints and engineering drawings) tend to stress rasterization and memory allocation.

Raster fan-out Pages containing many embedded images with large combined pixel counts (and no soft masks) are strong indicators of memory pressure and CPU spikes during rendering.

Conceptually, the advanced decision is a simple OR gate over these risk signals:

dangerous = render_risk OR rss_risk

Where:

render_risk is driven primarily by physical page dimensions

rss_risk is driven by raster fan-out and total pixel pressure

The advanced model is intentionally not configurable at runtime. Its thresholds are pre-tuned and designed to be stable, predictable, and comparable across environments.

This makes it ideal for:

Early rejection of pathological PDFs

Protecting OCR and AI pipelines from worst-case inputs

Fast, deterministic safety decisions at scale

You are free to rely on the default model, the advanced model, or both — depending on how strict your pipeline needs to be.

License

MIT License © 2025 — Not Empty Foundation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfsentinel-3.0.0.tar.gz (21.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdfsentinel-3.0.0-py3-none-any.whl (21.4 kB view details)

Uploaded Python 3

File details

Details for the file pdfsentinel-3.0.0.tar.gz.

File metadata

  • Download URL: pdfsentinel-3.0.0.tar.gz
  • Upload date:
  • Size: 21.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdfsentinel-3.0.0.tar.gz
Algorithm Hash digest
SHA256 a585d57733f31328465c538431d44a0449b3b8e6d22af63d36870c5fbc2ee85d
MD5 ddaa04b1465a91d22c028d269220e694
BLAKE2b-256 900e0f18fb2079ca23e5cb8e861fedf781570e1668e5eaa0209880d1ab0c8b8a

See more details on using hashes here.

File details

Details for the file pdfsentinel-3.0.0-py3-none-any.whl.

File metadata

  • Download URL: pdfsentinel-3.0.0-py3-none-any.whl
  • Upload date:
  • Size: 21.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdfsentinel-3.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2160dcedaca72231d0a4bd598683d71e59568196c0c2e27aa5b648115ed657c9
MD5 11ab69e4c71f6e64ca9408fe99abb829
BLAKE2b-256 aa3690e8f9dad3a9f37642057a2a5b7f403305c95fd8e3928234a0d24f92806b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page