Skip to main content

A Python script that runs Paddle OCR on a possibly unsearchable PDF to make it searchable.

Project description

PdfOCRer

PdfOCRer is a python script that runs OCR on an input PDF (possibly unsearchable) to produce a searchable PDF.

It uses Paddle OCR as the OCR engine. Compared to the famous OCR engine Tesseract, Paddle OCR shows better results for Chinese language in my experiments. However, it cannot seem to generate searchable PDF from its OCR result, whereas Tesseract can. If that's what you need, PdfOCRer can help.

PdfOCRer processes a PDF in four steps:

  • Step 1: convert each pdf page into an image (default format PNG).
  • Step 2: run OCR on each image to recognize text and bounding boxes.
  • Step 3: add text/bbox on image as hidden layer to make 1-page PDF.
  • Step 4: merge all 1-page pdfs to output the result searchable pdf.

Dependencies

The script is tested in Python 3.10. It should work as long as the dependencies work.

  • Paddle OCR and PaddlePaddle: follow its instruction here.

  • PyPDF2, Pillow, ReportLab: pip install pillow PyPDF2 reportlab

  • Ghostscript: used to convert pdf to images, called from the script as subprocess.

    -- on Linux/Ubuntu: apt-get install ghostscript,

    -- on MacOS machine: brew install ghostscript,

    -- on Windows machine: download 32/64 bit exe, run it to install.

Installation

Simply run in command line:

pip install pdfocrer

It will install all dependencies but Ghostscript.

How to use

To use the script in command line, run it like following:

python pdf_ocrer.py -i <input_pdf> -o <output_pdf> -l <language> -t <temp_dir> [--debug]

Example:

python pdf_ocrer.py -i ../example/scanned_page.pdf -o ../example/scanned_page.ocr.pdf -t ../temp -l ch --debug

To use it in python code, do something like:

import os, sys

from pdfocrer.pdf_ocrer import PdfOCRer

input = './example/scanned_page.pdf'
output = './example/scanned_page.ocr.pdf'
isDebug = True
tempDir = './temp'
language = 'ch'  # or 'en', 'korean', 'japan', 'latin', 'arabic',  etc.

pp = PdfOCRer(isDebug, tempDir)

pp.process_pdf(input, output, language)

Acknowledgement

Thanks to the authors of all the depencenies libraries, the Python and Open Source Community.

Project details


Release history Release notifications | RSS feed

This version

1.2

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfocrer-1.2.tar.gz (7.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

PdfOCRer-1.2-py3-none-any.whl (13.4 kB view details)

Uploaded Python 3

File details

Details for the file pdfocrer-1.2.tar.gz.

File metadata

  • Download URL: pdfocrer-1.2.tar.gz
  • Upload date:
  • Size: 7.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.10.15

File hashes

Hashes for pdfocrer-1.2.tar.gz
Algorithm Hash digest
SHA256 50cc5d9180b2d899c227dc2b1868b096d050cab3c5eea42856fd1fb181b4b883
MD5 1522132e872317304cdf249161a49e92
BLAKE2b-256 d1b46dcba64f9a0b47fcbcc8f07f3d0ddce2a7c138dd63d20e56232af719788b

See more details on using hashes here.

File details

Details for the file PdfOCRer-1.2-py3-none-any.whl.

File metadata

  • Download URL: PdfOCRer-1.2-py3-none-any.whl
  • Upload date:
  • Size: 13.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.10.15

File hashes

Hashes for PdfOCRer-1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 da99d2fc2f1c58d56b4f821e7f12ff300254e941b9e353d97413a0b17e7194d1
MD5 27c8e4d0bbf94337d604f1e4037652cf
BLAKE2b-256 fd40e41bfefdd95b566c006e3976d718f61365ac58bfd66c15025cc1c482aaf7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page