PDF Sentinel is a lightweight safety inspection library for PDF documents. It detects oversized, vector-heavy, or otherwise resource-intensive pages (like blueprints) that could slow down or crash OCR and pipelines.
Reason this release was yanked:
timeout is not working
Project description
PDF Sentinel
PDF Sentinel is a lightweight safety inspection library for PDF documents. It detects oversized, vector-heavy, or resource-intensive pages (like blueprints) that can slow down or crash OCR, rendering, or document-processing pipelines.
It is designed as a pre-flight guard before expensive operations such as OCR, Vision-LLM inference, rasterization, or downstream document pipelines.
Features
- Detects risky PDF pages:
- Oversized page dimensions (A0, engineering drawings, blueprints)
- Large embedded images
- Vector-heavy pages (architectural / CAD drawings)
- Pages with excessive rasterization cost
- Page-level and file-level analysis
- Two parallel safety models:
- Default (configurable, conservative)
- Advanced (tuned, risk-based)
- JSON output for API integration
Installation
pip install pdfsentinel
Quick Start
from pdfsentinel import PDFSentinel
sentinel = PDFSentinel()
print(sentinel.is_file_safe("samples/test.pdf"))
print(sentinel.is_page_safe("samples/test.pdf", 1, json_response=True))
Outputs
PDF Sentinel returns Python dicts by default.
If json_response=True, the same structure is returned as a JSON string.
All pages include both verdicts:
is_page_safety+errors(default model)is_page_safety_advanced+errors_advanced(advanced model)
File-level analysis includes summary strings so you can quickly see which pages failed:
unsafe_pages(comma-separated page numbers for default model)unsafe_pages_advanced(comma-separated page numbers for advanced model)
No error aggregation is done at the file root; reasons live on each page result.
Public API
1) file_analysis(file_path, config=None, json_response=False, timeout_seconds=None)
Runs a full scan of all pages and returns per-page results.
PDF Sentinel applies a 30 second document load timeout by default. You can
override it per call with timeout_seconds. If the analysis exceeds the active
timeout, the file is marked unsafe.
Returns (dict / JSON):
{
"file_name": "test.pdf",
"pages": 2,
"is_file_safety": false,
"unsafe_pages": "2",
"is_file_safety_advanced": true,
"unsafe_pages_advanced": "",
"results": [
{
"page": 1,
"is_page_safety": true,
"errors": [],
"is_page_safety_advanced": true,
"errors_advanced": [],
"metrics": {
"physical": {},
"images": [],
"vector": {},
"text": {}
},
"summary": {
"page_width_pt": 612.0,
"page_height_pt": 792.0,
"max_embedded_image_pixels": 0,
"vector_path_count": 58,
"raster_estimate_pixels_300dpi": 8415000
}
}
]
}
2) page_analysis(file_path, page, config=None, json_response=False, timeout_seconds=None)
Runs a detailed scan of a single page (1-based index).
Returns (dict / JSON):
{
"file_name": "test.pdf",
"page": 2,
"is_page_safety": false,
"errors": [
"raster_estimate_too_big:77760000"
],
"is_page_safety_advanced": true,
"errors_advanced": [],
"metrics": {
"physical": {},
"images": [],
"vector": {},
"text": {}
},
"summary": {
"page_width_pt": 2592.0,
"page_height_pt": 1728.0,
"max_embedded_image_pixels": 354652,
"vector_path_count": 33035,
"raster_estimate_pixels_300dpi": 77760000
}
}
If the page index is invalid, the method returns:
{
"file_name": "test.pdf",
"page": 999,
"is_page_safety": false,
"errors": ["invalid_page:999"],
"is_page_safety_advanced": false,
"errors_advanced": ["invalid_page:999"],
"metrics": { "physical": {}, "images": [], "vector": {}, "text": {} },
"summary": {
"page_width_pt": 0.0,
"page_height_pt": 0.0,
"max_embedded_image_pixels": 0,
"vector_path_count": 0,
"raster_estimate_pixels_300dpi": 0
}
}
page_analysis(...) also uses the default 30 second timeout unless you
override it with timeout_seconds. If the analysis exceeds the active timeout,
the page is returned as unsafe with a document_load_timeout:<seconds>s error.
3) is_file_safe(file_path, config=None, json_response=False, timeout_seconds=None)
Convenience method that returns only the unsafe pages (default + advanced). Useful for fast checks or CLI output.
Returns (dict / JSON):
{
"file_name": "test.pdf",
"pages": 2,
"is_file_safety": false,
"unsafety_pages": [
{
"page": 2,
"errors": [
"raster_estimate_too_big:77760000"
]
}
],
"is_file_safety_advanced": true,
"unsafety_pages_advanced": []
}
4) is_page_safe(file_path, page, config=None, json_response=False, timeout_seconds=None)
Convenience method for a single page.
This currently returns the same structure as page_analysis(...).
Returns (dict / JSON):
{
"file_name": "test.pdf",
"page": 2,
"is_page_safety": false,
"errors": [
"raster_estimate_too_big:77760000"
],
"is_page_safety_advanced": true,
"errors_advanced": []
}
Configuration (Default Model Only)
You can override default safety thresholds per call. The advanced model is tuned and not intended to be overridden at runtime.
sentinel.is_file_safe(
"samples/test.pdf",
config={
"max_page_size": 1800,
"max_image_pixels": 10_000_000,
"max_vectors_operations": 1000,
"max_raster_pixels": 20_000_000,
"document_load_timeout_seconds": 30
}
)
PDF Sentinel now applies a default document load timeout automatically. You can also override it per call when needed:
sentinel.is_file_safe("samples/test.pdf", timeout_seconds=30)
sentinel.page_analysis("samples/test.pdf", 1, timeout_seconds=30)
| Parameter | Default | Description |
|---|---|---|
| max_page_size | 2000 | Max page dimension in points (pt) |
| max_image_pixels | 20000000 | Max pixels for a single embedded image |
| max_vectors_operations | 1500 | Max allowed vector drawing operations |
| max_raster_pixels | 30000000 | Estimated raster size (300 DPI) |
| document_load_timeout_seconds | 30 | Max seconds allowed to open/analyze a PDF |
Advanced Safety Model (How it Works)
PDF Sentinel includes an advanced safety model that runs in parallel with the default rules.
While the default model focuses on conservative limits (page size, vector count, raster estimates), the advanced model is risk-based and tuned using real-world performance data from PDF rendering pipelines.
The advanced model flags a page as unsafe if any of the following conditions are met:
Extreme physical size Pages whose largest dimension exceeds a hard threshold are likely to cause excessive render time, regardless of content.
Very wide pages Unusually wide pages (common in blueprints and engineering drawings) tend to stress rasterization and memory allocation.
Raster fan-out Pages containing many embedded images with large combined pixel counts (and no soft masks) are strong indicators of memory pressure and CPU spikes during rendering.
Conceptually, the advanced decision is a simple OR gate over these risk signals:
dangerous = render_risk OR rss_risk
Where:
render_risk is driven primarily by physical page dimensions
rss_risk is driven by raster fan-out and total pixel pressure
The advanced model is intentionally not configurable at runtime. Its thresholds are pre-tuned and designed to be stable, predictable, and comparable across environments.
This makes it ideal for:
Early rejection of pathological PDFs
Protecting OCR and AI pipelines from worst-case inputs
Fast, deterministic safety decisions at scale
You are free to rely on the default model, the advanced model, or both — depending on how strict your pipeline needs to be.
License
MIT License © 2025 — Not Empty Foundation
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdfsentinel-3.0.0.tar.gz.
File metadata
- Download URL: pdfsentinel-3.0.0.tar.gz
- Upload date:
- Size: 21.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a585d57733f31328465c538431d44a0449b3b8e6d22af63d36870c5fbc2ee85d
|
|
| MD5 |
ddaa04b1465a91d22c028d269220e694
|
|
| BLAKE2b-256 |
900e0f18fb2079ca23e5cb8e861fedf781570e1668e5eaa0209880d1ab0c8b8a
|
File details
Details for the file pdfsentinel-3.0.0-py3-none-any.whl.
File metadata
- Download URL: pdfsentinel-3.0.0-py3-none-any.whl
- Upload date:
- Size: 21.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2160dcedaca72231d0a4bd598683d71e59568196c0c2e27aa5b648115ed657c9
|
|
| MD5 |
11ab69e4c71f6e64ca9408fe99abb829
|
|
| BLAKE2b-256 |
aa3690e8f9dad3a9f37642057a2a5b7f403305c95fd8e3928234a0d24f92806b
|