PDF Sentinel is a lightweight safety inspection library for PDF documents. It detects oversized, vector-heavy, or otherwise resource-intensive pages (like blueprints) that could slow down or crash OCR and pipelines.

These details have not been verified by PyPI

Project links

License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

PDF Sentinel

PDF Sentinel is a lightweight safety inspection library for PDF documents. It detects oversized, vector-heavy, or resource-intensive pages (like blueprints) that can slow down or crash OCR, rendering, or document-processing pipelines.

It is designed as a pre-flight guard before expensive operations such as OCR, Vision-LLM inference, rasterization, or downstream document pipelines.

Features

Detects risky PDF pages:
- Oversized page dimensions (A0, engineering drawings, blueprints)
- Large embedded images
- Vector-heavy pages (architectural / CAD drawings)
- Pages with excessive rasterization cost
Page-level and file-level analysis
Two parallel safety models:
- Default (configurable, conservative)
- Advanced (tuned, risk-based)
JSON output for API integration

Installation

pip install pdfsentinel

Quick Start

from pdfsentinel import PDFSentinel

sentinel = PDFSentinel()

print(sentinel.is_file_safe("samples/test.pdf"))
print(sentinel.is_page_safe("samples/test.pdf", 1, json_response=True))

Outputs

PDF Sentinel returns Python dicts by default. If json_response=True, the same structure is returned as a JSON string.

All pages include both verdicts:

is_page_safety + errors (default model)
is_page_safety_advanced + errors_advanced (advanced model)

File-level analysis includes summary strings so you can quickly see which pages failed:

unsafe_pages (comma-separated page numbers for default model)
unsafe_pages_advanced (comma-separated page numbers for advanced model)

No error aggregation is done at the file root; reasons live on each page result.

Public API

1) `file_analysis(file_path, config=None, json_response=False, timeout_seconds=None)`

Runs a full scan of all pages and returns per-page results.

PDF Sentinel applies a 30 second document load timeout by default. You can override it per call with timeout_seconds. If the analysis exceeds the active timeout, the file is marked unsafe.

Returns (dict / JSON):

{
  "file_name": "test.pdf",
  "pages": 2,
  "is_file_safety": false,
  "unsafe_pages": "2",
  "is_file_safety_advanced": true,
  "unsafe_pages_advanced": "",
  "results": [
    {
      "page": 1,
      "is_page_safety": true,
      "errors": [],
      "is_page_safety_advanced": true,
      "errors_advanced": [],
      "metrics": {
        "physical": {},
        "images": [],
        "vector": {},
        "text": {}
      },
      "summary": {
        "page_width_pt": 612.0,
        "page_height_pt": 792.0,
        "max_embedded_image_pixels": 0,
        "vector_path_count": 58,
        "raster_estimate_pixels_300dpi": 8415000
      }
    }
  ]
}

2) `page_analysis(file_path, page, config=None, json_response=False, timeout_seconds=None)`

Runs a detailed scan of a single page (1-based index).

Returns (dict / JSON):

{
  "file_name": "test.pdf",
  "page": 2,
  "is_page_safety": false,
  "errors": [
    "raster_estimate_too_big:77760000"
  ],
  "is_page_safety_advanced": true,
  "errors_advanced": [],
  "metrics": {
    "physical": {},
    "images": [],
    "vector": {},
    "text": {}
  },
  "summary": {
    "page_width_pt": 2592.0,
    "page_height_pt": 1728.0,
    "max_embedded_image_pixels": 354652,
    "vector_path_count": 33035,
    "raster_estimate_pixels_300dpi": 77760000
  }
}

If the page index is invalid, the method returns:

{
  "file_name": "test.pdf",
  "page": 999,
  "is_page_safety": false,
  "errors": ["invalid_page:999"],
  "is_page_safety_advanced": false,
  "errors_advanced": ["invalid_page:999"],
  "metrics": { "physical": {}, "images": [], "vector": {}, "text": {} },
  "summary": {
    "page_width_pt": 0.0,
    "page_height_pt": 0.0,
    "max_embedded_image_pixels": 0,
    "vector_path_count": 0,
    "raster_estimate_pixels_300dpi": 0
  }
}

page_analysis(...) also uses the default 30 second timeout unless you override it with timeout_seconds. If the analysis exceeds the active timeout, the page is returned as unsafe with a document_load_timeout:<seconds>s error.

3) `is_file_safe(file_path, config=None, json_response=False, timeout_seconds=None)`

Convenience method that returns only the unsafe pages (default + advanced). Useful for fast checks or CLI output.

Returns (dict / JSON):

{
  "file_name": "test.pdf",
  "pages": 2,
  "is_file_safety": false,
  "unsafety_pages": [
    {
      "page": 2,
      "errors": [
        "raster_estimate_too_big:77760000"
      ]
    }
  ],
  "is_file_safety_advanced": true,
  "unsafety_pages_advanced": []
}

4) `is_page_safe(file_path, page, config=None, json_response=False, timeout_seconds=None)`

Convenience method for a single page. This currently returns the same structure as page_analysis(...).

Returns (dict / JSON):

{
  "file_name": "test.pdf",
  "page": 2,
  "is_page_safety": false,
  "errors": [
    "raster_estimate_too_big:77760000"
  ],
  "is_page_safety_advanced": true,
  "errors_advanced": []
}

Configuration (Default Model Only)

You can override default safety thresholds per call. The advanced model is tuned and not intended to be overridden at runtime.

sentinel.is_file_safe(
    "samples/test.pdf",
    config={
        "max_page_size": 1800,
        "max_image_pixels": 10_000_000,
        "max_vectors_operations": 1000,
        "max_raster_pixels": 20_000_000,
        "document_load_timeout_seconds": 30
    }
)

PDF Sentinel now applies a default document load timeout automatically. You can also override it per call when needed:

sentinel.is_file_safe("samples/test.pdf", timeout_seconds=30)
sentinel.page_analysis("samples/test.pdf", 1, timeout_seconds=30)

Parameter	Default	Description
max_page_size	2000	Max page dimension in points (pt)
max_image_pixels	20000000	Max pixels for a single embedded image
max_vectors_operations	1500	Max allowed vector drawing operations
max_raster_pixels	30000000	Estimated raster size (300 DPI)
document_load_timeout_seconds	30	Max seconds allowed to open/analyze a PDF

Advanced Safety Model (How it Works)

PDF Sentinel includes an advanced safety model that runs in parallel with the default rules.

While the default model focuses on conservative limits (page size, vector count, raster estimates), the advanced model is risk-based and tuned using real-world performance data from PDF rendering pipelines.

The advanced model flags a page as unsafe if any of the following conditions are met:

Extreme physical size Pages whose largest dimension exceeds a hard threshold are likely to cause excessive render time, regardless of content.

Very wide pages Unusually wide pages (common in blueprints and engineering drawings) tend to stress rasterization and memory allocation.

Raster fan-out Pages containing many embedded images with large combined pixel counts (and no soft masks) are strong indicators of memory pressure and CPU spikes during rendering.

Conceptually, the advanced decision is a simple OR gate over these risk signals:

dangerous = render_risk OR rss_risk

Where:

render_risk is driven primarily by physical page dimensions

rss_risk is driven by raster fan-out and total pixel pressure

The advanced model is intentionally not configurable at runtime. Its thresholds are pre-tuned and designed to be stable, predictable, and comparable across environments.

This makes it ideal for:

Early rejection of pathological PDFs

Protecting OCR and AI pipelines from worst-case inputs

Fast, deterministic safety decisions at scale

You are free to rely on the default model, the advanced model, or both — depending on how strict your pipeline needs to be.

License

Project details

These details have not been verified by PyPI

Project links

License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

3.0.1

May 13, 2026

3.0.0 yanked

May 13, 2026

Reason this release was yanked:

timeout is not working

2.0.0

Jan 11, 2026

1.2.0

Jan 7, 2026

1.1.0

Nov 13, 2025

1.0.0

Nov 7, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfsentinel-3.0.1.tar.gz (21.7 kB view details)

Uploaded May 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdfsentinel-3.0.1-py3-none-any.whl (21.5 kB view details)

Uploaded May 13, 2026 Python 3

File details

Details for the file pdfsentinel-3.0.1.tar.gz.

File metadata

Download URL: pdfsentinel-3.0.1.tar.gz
Upload date: May 13, 2026
Size: 21.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdfsentinel-3.0.1.tar.gz
Algorithm	Hash digest
SHA256	`7cc7b69813f945a89018ab8569f93a7feb56a72c3b36e3c283f8f51d173d1539`
MD5	`985dac740a075a530c59b81e7a05b90e`
BLAKE2b-256	`0f11118047b0b3d90dfe4f100b7753cc812be7d5da3ef5a4dda77f9b042cfe4f`

See more details on using hashes here.

File details

Details for the file pdfsentinel-3.0.1-py3-none-any.whl.

File metadata

Download URL: pdfsentinel-3.0.1-py3-none-any.whl
Upload date: May 13, 2026
Size: 21.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdfsentinel-3.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ceb6c5559dda1af584bb3afba1288695eb299af055783683a2d451f0a082f0e4`
MD5	`6b785a704258b82b5a061f0e45abe9dd`
BLAKE2b-256	`91075d2189e23bb29704db7c126579d1213fbc4032567b7ceb360ca29a507453`

See more details on using hashes here.

PDFSentinel 3.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PDF Sentinel

Features

Installation

Quick Start

Outputs

Public API

1) `file_analysis(file_path, config=None, json_response=False, timeout_seconds=None)`

2) `page_analysis(file_path, page, config=None, json_response=False, timeout_seconds=None)`

3) `is_file_safe(file_path, config=None, json_response=False, timeout_seconds=None)`

4) `is_page_safe(file_path, page, config=None, json_response=False, timeout_seconds=None)`

Configuration (Default Model Only)

Advanced Safety Model (How it Works)

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

PDFSentinel 3.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PDF Sentinel

Features

Installation

Quick Start

Outputs

Public API

1) file_analysis(file_path, config=None, json_response=False, timeout_seconds=None)

2) page_analysis(file_path, page, config=None, json_response=False, timeout_seconds=None)

3) is_file_safe(file_path, config=None, json_response=False, timeout_seconds=None)

4) is_page_safe(file_path, page, config=None, json_response=False, timeout_seconds=None)

Configuration (Default Model Only)

Advanced Safety Model (How it Works)

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

1) `file_analysis(file_path, config=None, json_response=False, timeout_seconds=None)`

2) `page_analysis(file_path, page, config=None, json_response=False, timeout_seconds=None)`

3) `is_file_safe(file_path, config=None, json_response=False, timeout_seconds=None)`

4) `is_page_safe(file_path, page, config=None, json_response=False, timeout_seconds=None)`