Skip to main content

Detect blank or content pages in PDFs using OCR and image analysis

Project description

Use Page Filter

Use Page Filter is a lightweight Python package that detects whether a page in a PDF contains useful content or should be considered blank.

It is designed for real-world document pipelines such as tax documents, scanned PDFs, and mixed digital/scanned files where pages may contain very small text, thin characters, shapes, or tables.
So it can be used independantly in your pipeline so to reduce the extra use of VLLMs and similar other tasks.

The package prioritizes not missing real text, even if it is only a single character.


Features

  • Detect blank pages in PDFs
  • Works with scanned and digital PDFs
  • Detects very thin characters (e.g., A, S)
  • Handles pages with tables or shapes
  • Avoids OCR hallucinations from simple shapes
  • Multi-stage OCR detection pipeline
  • Lightweight and CPU-friendly

How It Works

Each page is processed through a progressive detection pipeline.
The system stops as soon as text is detected.

Processing Flow

  1. Native PDF Text Extraction

    • If the page already contains digital text → CONTENT
  2. Direct OCR

    • Run OCR on the rendered page
  3. Rotated OCR

    • Try OCR with rotations: 90°, 180°, 270°
  4. Scaled OCR

    • Downscale image (50%, 25%) and run OCR
  5. Dilated + Scaled OCR

    • Strengthen thin strokes
    • Run OCR again
  6. Shape Validation

    • Prevent shapes like boxes or lines from being interpreted as text

Final Decision

  • If any stage detects a valid letter or number → CONTENT
  • Otherwise → BLANK

Installation

Install from source

pip install use-page-filter

Install Tesseract OCR (Required)

Tesseract OCR must be installed separately.

Ubuntu: bash sudo apt install tesseract-ocr

Mac: bash brew install tesseract

Windows: https://github.com/UB-Mannheim/tesseract/wiki

Quick Example

from use_page_filter import process_pdf 
results = process_pdf("document.pdf")  

for page in results:  
  print(page) 

Example output:

Page 1  | CONTENT | Native text
Page 2  | BLANK   | Solid page
Page 3  | CONTENT | OCR scaled

Detect a Single Page

import fitz
from use_page_filter import detect_page
doc = fitz.open("document.pdf")

for page in doc:
    is_blank, reason, confidence = detect_page(page)
    print(is_blank, reason, confidence)

Limitations

  • Very stylized fonts may not be detected.
  • Complex graphical pages may require additional heuristics.
  • OCR accuracy depends on the Tesseract engine.

Note | This is Made in a day for Personal Use

Feel free to contribute

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

use_page_filter-0.1.0.tar.gz (4.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

use_page_filter-0.1.0-py3-none-any.whl (5.2 kB view details)

Uploaded Python 3

File details

Details for the file use_page_filter-0.1.0.tar.gz.

File metadata

  • Download URL: use_page_filter-0.1.0.tar.gz
  • Upload date:
  • Size: 4.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for use_page_filter-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e6d4e168156a3b25eac4a22e010e2d56a496b52831e22a640d32a8927e82da6e
MD5 0164f3dd88c65d7fb90dbade2e172380
BLAKE2b-256 49de414623405f116493c6e453ef141800715411d467fa21689814e25fb3ffd3

See more details on using hashes here.

File details

Details for the file use_page_filter-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for use_page_filter-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8ee656660d4a848f349cb293e14b5c9e651c93eecd9117ff18354ab11edb614a
MD5 f024b36780f80c89161cfc1d838421e9
BLAKE2b-256 2ac72b2769ef7a32877edb27de64b7ed2d41de45b3bf10e8be659edb192aab73

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page