Skip to main content

No project description provided

Project description

handle_scanned_pdf

A wrapper on top of the Python-OCR tool pytesseract, that utilizes Google’s Tesseract-OCR Engine to recognize and extract text embedded in images.

Source code can be accessed here sxaxmz/handdle_scanned_pdf

Install the package using:

$ pip install handle-scanned-pdf

Only if required set the below path to Tesseract executable:

pytesseract.pytesseract.tesseract_cmd = r'/usr/bin/tesseract/tesseract.exe'

Tesseract-OCR supports:

  • Various image types including (but not limited to) jpeg, png, gif, bmp, tiff.
  • Wide range of languages list of languages
Server Installation
$ apt install tesseract-ocr
$ apt-get install poppler-utils
Language Support

Ensure to download the right Tesseract-OCR for the language needed to be used.

Installation on Linux:

$ apt install tesseract-ocr-<language-code>

Download for Windows (set path to the downloaded OCR):

  • Download language files
  • Add the folder that contains the downloaded files into the System Path Variables as TESSDATA_PREFIX
Packages Required (src: requirements.txt):
pytesseract===0.3.10
pdf2image===1.17.0
PyPDF2===3.0.1
opencv-python

Easy-to-Use:

  • Straightforward functions.
  • Customizable process.
  • JSON output.

Draw bounding boxes on the text that can be extracted from PDF

from handle_scanned_pdf import draw_bounding_boxes

img_path = 'sample__images/3ba4c1f1-775f-4e05-ab48-a40617087a57-1.png'
img = np.array(cv2.imread(img_path)) # Read image and convert to numpy array
output_path = 'output'
file_name = os.path.basename(img_path).split('.')[0]
pageNum = 0
draw_bounding_boxes(img, output_path, file_name, pageNum)
Output:
output/images_bounding/3ba4c1f1-775f-4e05-ab48-a40617087a57-1_bounding_images/text_with_boxes_3ba4c1f1-775f-4e05-ab48-a40617087a57-1_0.jpg

Get text in Bulk from Multiple PDF files

from handle_scanned_pdf import get_pdf_text_bulk_pdf

pdf_folder_path = 'pdf_files'
output_path = 'output'
lang_code = 'ara'
draw_boxes = True
get_pdf_text_bulk_pdf(pdf_folder_path, output_path, lang_code, draw_boxes)
Output:
{'number_of_files': 1,
 'txt_file_path_bulk': ['output/sample_.pdf'],
 'bounding_img_path': ['output/images_bounding/text_with_boxes_3ba4c1f1-775f-4e05-ab48-a40617087a57-1_0.jpg',
  'output/images_bounding/text_with_boxes_3ba4c1f1-775f-4e05-ab48-a40617087a57-1_1.jpg',
  'output/images_bounding/text_with_boxes_3ba4c1f1-775f-4e05-ab48-a40617087a57-1_2.jpg',
  'output/images_bounding/text_with_boxes_3ba4c1f1-775f-4e05-ab48-a40617087a57-1_3.jpg',
  'output/images_bounding/text_with_boxes_3ba4c1f1-775f-4e05-ab48-a40617087a57-1_4.jpg']}

Get text from a single PDF file

from handle_scanned_pdf import get_pdf_text

pdf_path_ = 'sample_.pdf'
output_path = 'output'
lang_code = 'ara'
draw_boxes = True
get_pdf_text(pdf_path_, output_path, lang_code, draw_boxes)
Output:
{'bounding_img_path': ['output/images_bounding/pdf_bounding_images/text_with_boxes_pdf_0.jpg',
  'output/images_bounding/pdf_bounding_images/text_with_boxes_pdf_1.jpg',
  'output/images_bounding/pdf_bounding_images/text_with_boxes_pdf_2.jpg',
  'output/images_bounding/pdf_bounding_images/text_with_boxes_pdf_3.jpg',
  'output/images_bounding/pdf_bounding_images/text_with_boxes_pdf_4.jpg'],
 'txt_file_path': 'output/sample_.txt'}

Extract text, draw bounding boxes, and convert PDF file to text searchable PDF

from handle_scanned_pdf import scanned_pdf_to_text_searchable_pdf

file_pdf = 'sample_.pdf'
output_folder_path_img = 'img'
output_path = 'output'
lang_code = 'ara'
image_converted_format = 'png
get_text=True
draw_boxes=False
scanned_pdf_to_text_searchable_pdf(file_pdf, output_folder_path_img, output_path, lang_code, image_converted_format, get_text, draw_boxes)
Output:
{'file_name': 'sample_',
 'img_path': 'output/img/sample__images',
 'pdf_path': 'output/searchable_pdf_sample_.pdf',
 'number_of_pages': 5,
 'text_file': {'bounding_img_path': [],
  'txt_file_path': 'output/sample_.txt'}}

Extract text, draw bounding boxes, and convert PDF file to text searchable PDF in Bulk

pdf_folder_path = 'pdf_files'
output_folder_path_img = 'img'
output_path = 'output'
lang_code = 'ara'
image_converted_format = 'png'
get_text=True
draw_boxes=False
scanned_pdf_to_text_searchable_pdf_bulk(pdf_folder_path, output_folder_path_img, output_path, lang_code, image_converted_format, get_text, draw_boxes)
Output:
{'number_files_converted': 1,
 'files_details': [{'file_name': 'sample_',
   'img_path': 'output/img/sample__images',
   'pdf_path': 'output/searchable_pdf_sample_.pdf',
   'number_of_pages': 5,
   'text_file': {'bounding_img_path': [],
    'txt_file_path': 'output/sample_.txt'}}]}

References:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

handle_scanned_pdf-0.4.tar.gz (8.7 kB view details)

Uploaded Source

Built Distribution

handle_scanned_pdf-0.4-py3-none-any.whl (7.6 kB view details)

Uploaded Python 3

File details

Details for the file handle_scanned_pdf-0.4.tar.gz.

File metadata

  • Download URL: handle_scanned_pdf-0.4.tar.gz
  • Upload date:
  • Size: 8.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.3

File hashes

Hashes for handle_scanned_pdf-0.4.tar.gz
Algorithm Hash digest
SHA256 667d32c1059e06d5f0a3562cf4553f63831d8779a18a7c4fdc5596825b302627
MD5 cd78adb9e240b230e04b7afbbd68923b
BLAKE2b-256 fcc48ef3d9108d7855fbb77dd7f46ed828d8e2719aa08206eb85fa99bfb2f49d

See more details on using hashes here.

File details

Details for the file handle_scanned_pdf-0.4-py3-none-any.whl.

File metadata

File hashes

Hashes for handle_scanned_pdf-0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 cb995f98a014dea1bf662cd01dc8a0b7d642a36dd4b223c03013af1751811c59
MD5 1672457f2de8987abed0388ccb8f4ed4
BLAKE2b-256 a1ce8c2d98f257249c608514fda62eb2e6145f67dc8372d7275c0f3f7c956f6f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page