Detect blank or content pages in PDFs using OCR and image analysis

Project description

Use Page Filter

Use Page Filter is a lightweight Python package that detects whether a page in a PDF contains useful content or should be considered blank.

It is designed for real-world document pipelines such as tax documents, scanned PDFs, and mixed digital/scanned files where pages may contain very small text, thin characters, shapes, or tables.
So it can be used independantly in your pipeline so to reduce the extra use of VLLMs and similar other tasks.

The package prioritizes not missing real text, even if it is only a single character.

Features

Detect blank pages in PDFs
Works with scanned and digital PDFs
Detects very thin characters (e.g., A, S)
Handles pages with tables or shapes
Avoids OCR hallucinations from simple shapes
Multi-stage OCR detection pipeline
Lightweight and CPU-friendly

How It Works

Each page is processed through a progressive detection pipeline.
The system stops as soon as text is detected.

Processing Flow

Native PDF Text Extraction
- If the page already contains digital text → CONTENT
Direct OCR
- Run OCR on the rendered page
Rotated OCR
- Try OCR with rotations: 90°, 180°, 270°
Scaled OCR
- Downscale image (50%, 25%) and run OCR
Dilated + Scaled OCR
- Strengthen thin strokes
- Run OCR again
Shape Validation
- Prevent shapes like boxes or lines from being interpreted as text

Final Decision

If any stage detects a valid letter or number → CONTENT
Otherwise → BLANK

Installation

Install from source

pip install use-page-filter

Install Tesseract OCR (Required)

Tesseract OCR must be installed separately.

Ubuntu: bash sudo apt install tesseract-ocr

Mac: bash brew install tesseract

Windows: https://github.com/UB-Mannheim/tesseract/wiki

Quick Example

from use_page_filter import process_pdf 
results = process_pdf("document.pdf")  

for page in results:  
  print(page)

Example output:

Page 1  | CONTENT | Native text
Page 2  | BLANK   | Solid page
Page 3  | CONTENT | OCR scaled

Detect a Single Page

import fitz
from use_page_filter import detect_page
doc = fitz.open("document.pdf")

for page in doc:
    is_blank, reason, confidence = detect_page(page)
    print(is_blank, reason, confidence)

Limitations

Very stylized fonts may not be detected.
Complex graphical pages may require additional heuristics.
OCR accuracy depends on the Tesseract engine.

Note | This is Made in a day for Personal Use

Feel free to contribute

Project details

Release history Release notifications | RSS feed

This version

0.1.0

May 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

use_page_filter-0.1.0.tar.gz (4.5 kB view details)

Uploaded May 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

use_page_filter-0.1.0-py3-none-any.whl (5.2 kB view details)

Uploaded May 6, 2026 Python 3

File details

Details for the file use_page_filter-0.1.0.tar.gz.

File metadata

Download URL: use_page_filter-0.1.0.tar.gz
Upload date: May 6, 2026
Size: 4.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for use_page_filter-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`e6d4e168156a3b25eac4a22e010e2d56a496b52831e22a640d32a8927e82da6e`
MD5	`0164f3dd88c65d7fb90dbade2e172380`
BLAKE2b-256	`49de414623405f116493c6e453ef141800715411d467fa21689814e25fb3ffd3`

See more details on using hashes here.

File details

Details for the file use_page_filter-0.1.0-py3-none-any.whl.

File metadata

Download URL: use_page_filter-0.1.0-py3-none-any.whl
Upload date: May 6, 2026
Size: 5.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for use_page_filter-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8ee656660d4a848f349cb293e14b5c9e651c93eecd9117ff18354ab11edb614a`
MD5	`f024b36780f80c89161cfc1d838421e9`
BLAKE2b-256	`2ac72b2769ef7a32877edb27de64b7ed2d41de45b3bf10e8be659edb192aab73`

See more details on using hashes here.

use-page-filter 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Use Page Filter

Features

How It Works

Processing Flow

Final Decision

Installation

Install from source

Install Tesseract OCR (Required)

Quick Example

Detect a Single Page

Limitations

Note | This is Made in a day for Personal Use

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes