Detect blank or content pages in PDFs using OCR and image analysis
Project description
Use Page Filter
Use Page Filter is a lightweight Python package that detects whether a page in a PDF contains useful content or should be considered blank.
It is designed for real-world document pipelines such as tax documents, scanned PDFs, and mixed digital/scanned files where pages may contain very small text, thin characters, shapes, or tables.
So it can be used independantly in your pipeline so to reduce the extra use of VLLMs and similar other tasks.
The package prioritizes not missing real text, even if it is only a single character.
Features
- Detect blank pages in PDFs
- Works with scanned and digital PDFs
- Detects very thin characters (e.g.,
A,S) - Handles pages with tables or shapes
- Avoids OCR hallucinations from simple shapes
- Multi-stage OCR detection pipeline
- Lightweight and CPU-friendly
How It Works
Each page is processed through a progressive detection pipeline.
The system stops as soon as text is detected.
Processing Flow
-
Native PDF Text Extraction
- If the page already contains digital text →
CONTENT
- If the page already contains digital text →
-
Direct OCR
- Run OCR on the rendered page
-
Rotated OCR
- Try OCR with rotations:
90°,180°,270°
- Try OCR with rotations:
-
Scaled OCR
- Downscale image (
50%,25%) and run OCR
- Downscale image (
-
Dilated + Scaled OCR
- Strengthen thin strokes
- Run OCR again
-
Shape Validation
- Prevent shapes like boxes or lines from being interpreted as text
Final Decision
- If any stage detects a valid letter or number →
CONTENT - Otherwise →
BLANK
Installation
Install from source
pip install use-page-filter
Install Tesseract OCR (Required)
Tesseract OCR must be installed separately.
Ubuntu:
bash sudo apt install tesseract-ocr
Mac:
bash brew install tesseract
Windows: https://github.com/UB-Mannheim/tesseract/wiki
Quick Example
from use_page_filter import process_pdf
results = process_pdf("document.pdf")
for page in results:
print(page)
Example output:
Page 1 | CONTENT | Native text
Page 2 | BLANK | Solid page
Page 3 | CONTENT | OCR scaled
Detect a Single Page
import fitz
from use_page_filter import detect_page
doc = fitz.open("document.pdf")
for page in doc:
is_blank, reason, confidence = detect_page(page)
print(is_blank, reason, confidence)
Limitations
- Very stylized fonts may not be detected.
- Complex graphical pages may require additional heuristics.
- OCR accuracy depends on the Tesseract engine.
Note | This is Made in a day for Personal Use
Feel free to contribute
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file use_page_filter-0.1.0.tar.gz.
File metadata
- Download URL: use_page_filter-0.1.0.tar.gz
- Upload date:
- Size: 4.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e6d4e168156a3b25eac4a22e010e2d56a496b52831e22a640d32a8927e82da6e
|
|
| MD5 |
0164f3dd88c65d7fb90dbade2e172380
|
|
| BLAKE2b-256 |
49de414623405f116493c6e453ef141800715411d467fa21689814e25fb3ffd3
|
File details
Details for the file use_page_filter-0.1.0-py3-none-any.whl.
File metadata
- Download URL: use_page_filter-0.1.0-py3-none-any.whl
- Upload date:
- Size: 5.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8ee656660d4a848f349cb293e14b5c9e651c93eecd9117ff18354ab11edb614a
|
|
| MD5 |
f024b36780f80c89161cfc1d838421e9
|
|
| BLAKE2b-256 |
2ac72b2769ef7a32877edb27de64b7ed2d41de45b3bf10e8be659edb192aab73
|