Skip to main content

No project description provided

Project description

handle_scanned_pdf

A wrapper on top of python-OCR tools such as pytesseract and easyocr, to recognize and extract text embedded in images.

Source code can be accessed here sxaxmz/handdle_scanned_pdf

Install the package using:

$ pip install handle-scanned-pdf

Downloads Downloads


Features:

  • Convert scanned-PDFs to text searchable PDFs (end-to-end).
  • Extract text from scanned PDFs and images.
  • Draw bounding boxes around the text that can be extracted on scanned PDFs and images.

Usage:

  • Make scanned documents searchable and parsable.
  • Helpful in digitizing archives.
  • Make use of scanned documents and images when it's intended to be used for RAG applications.

Tesseract-OCR supports:
  • Various image types including (but not limited to) jpeg, png, gif, bmp, tiff.
  • Wide range of languages list of languages
  • Supports reading more than 1 language at a time.

Server Installation

$ apt install tesseract-ocr
$ apt-get install poppler-utils

Only if required set the below path to Tesseract executable:

pytesseract.pytesseract.tesseract_cmd = r'/usr/bin/tesseract/tesseract.exe'
EasyOCR supports:
  • Does not require server installation.
  • Supports reading more than 1 language at a time.
  • Performes faster on a GPU.
  • List of supported language code.
Language Support
Tesseract
  • Ensure to download the right Tesseract-OCR for the language needed to be used.

Installation on Linux:

$ apt install tesseract-ocr-<language-code>

Download for Windows (set path to the downloaded OCR):

  • Download language files
  • Add the folder that contains the downloaded files into the System Path Variables as TESSDATA_PREFIX

Defining the language codes for Tesseract:

lang_code = "eng+ara"
txt_extract_lang_code = "eng+ara"
EasyOCR

Defining the language codes for EasyOCR:

lang_code = ["en","ar"]
txt_extract_lang_code = ["en","ar"]
Packages Required (src: requirements.txt):
pytesseract===0.3.10
pdf2image===1.17.0
PyPDF2===3.0.1
opencv-python

Easy-to-Use:

  • Straightforward functions.
  • Customizable process.
  • JSON output.

Draw bounding boxes on the text that can be extracted from PDF

from handle_scanned_pdf import draw_bounding_boxes

img_path = 'sample__images/3ba4c1f1-775f-4e05-ab48-a40617087a57-1.png'
img = np.array(cv2.imread(img_path)) # Read image and convert to numpy array
output_path = 'output'
file_name = os.path.basename(img_path).split('.')[0]
pageNum = 0
draw_bounding_boxes(img, output_path, file_name, pageNum)
Output:
output/images_bounding/3ba4c1f1-775f-4e05-ab48-a40617087a57-1_bounding_images/text_with_boxes_3ba4c1f1-775f-4e05-ab48-a40617087a57-1_0.jpg

Get text in Bulk from Multiple PDF files

from handle_scanned_pdf import get_pdf_text_bulk_pdf

pdf_folder_path = 'pdf_files'
output_path = 'output'
draw_boxes = True
lang_code = ['en'] # 'eng'
ocr_used = 'easyocr' # 'tesseract'
lang_rtl = True
get_pdf_text_bulk_pdf(pdf_folder_path, output_path, lang_code, ocr_used, lang_rtl, draw_boxes)
Output:
{'number_of_files': 1,
 'txt_file_path_bulk': ['output/sample_.pdf'],
 'bounding_img_path': ['output/images_bounding/text_with_boxes_3ba4c1f1-775f-4e05-ab48-a40617087a57-1_0.jpg',
  'output/images_bounding/text_with_boxes_3ba4c1f1-775f-4e05-ab48-a40617087a57-1_1.jpg',
  'output/images_bounding/text_with_boxes_3ba4c1f1-775f-4e05-ab48-a40617087a57-1_2.jpg',
  'output/images_bounding/text_with_boxes_3ba4c1f1-775f-4e05-ab48-a40617087a57-1_3.jpg',
  'output/images_bounding/text_with_boxes_3ba4c1f1-775f-4e05-ab48-a40617087a57-1_4.jpg']}

Get text from a single PDF file

from handle_scanned_pdf import get_pdf_text

pdf_path_ = 'pdf_files/sample_.pdf'
output_path = 'output'
draw_boxes = True
lang_code = ['ar', 'en'] # 'ara+eng'
ocr_used = 'easyocr' # 'tesseract'
lang_rtl = True
get_pdf_text(pdf_path_, output_path, lang_code, ocr_used, lang_rtl, draw_boxes)
Output:
{'bounding_img_path': ['output/images_bounding/pdf_bounding_images/text_with_boxes_pdf_0.jpg',
  'output/images_bounding/pdf_bounding_images/text_with_boxes_pdf_1.jpg',
  'output/images_bounding/pdf_bounding_images/text_with_boxes_pdf_2.jpg',
  'output/images_bounding/pdf_bounding_images/text_with_boxes_pdf_3.jpg',
  'output/images_bounding/pdf_bounding_images/text_with_boxes_pdf_4.jpg'],
 'txt_file_path': 'output/sample_.txt'}

Extract text, draw bounding boxes, and convert PDF file to text searchable PDF

from handle_scanned_pdf import scanned_pdf_to_text_searchable_pdf

file_pdf = 'sample_.pdf'
output_folder_path_img = 'img'
output_path = 'output'
lang_code = ['ar', 'en'] #'ara+eng'
image_converted_format = 'png'
ocr_used = 'easyocr' # 'tesseract'
ocr_used_txt_extraction = 'easyocr' # 'tesseract'
txt_extract_lang_code = ['ar', 'en'] #'ara+eng'
font_name = 'Scheherazade'
font_ttf_path = 'ScheherazadeNew-Regular.ttf'
font_size = 12
lang_rtl = True
non_standard_font = True
get_text=False
draw_boxes=False
scanned_pdf_to_text_searchable_pdf(file_pdf, output_folder_path_img, output_path, lang_code, ocr_used, ocr_used_txt_extraction, txt_extract_lang_code, font_name, font_ttf_path, font_size, lang_rtl, non_standard_font, image_converted_format, get_text, draw_boxes)
Output:
{'file_name': 'sample_',
 'img_path': 'output/img/sample__images',
 'pdf_path': 'output/searchable_pdf_sample_.pdf',
 'number_of_pages': 5,
 'text_file': {'bounding_img_path': [],
  'txt_file_path': 'output/sample_.txt'}}

Extract text, draw bounding boxes, and convert PDF file to text searchable PDF in Bulk

from handle_scanned_pdf import scanned_pdf_to_text_searchable_pdf_bulk

pdf_folder_path = 'pdf_files'
output_folder_path_img = 'img'
output_path = 'output'
lang_code = ['ar', 'en'] #'ara+eng'
image_converted_format = 'png'
ocr_used = 'easyocr' # 'tesseract'
ocr_used_txt_extraction = 'easyocr' # 'tesseract'
txt_extract_lang_code = ['ar', 'en'] #'ara+eng'
font_name = 'Scheherazade'
font_path = 'ScheherazadeNew-Regular.ttf'
font_size = 12
lang_rtl = True
non_standard_font = True
get_text=True
draw_boxes=False
scanned_pdf_to_text_searchable_pdf_bulk(pdf_folder_path, output_folder_path_img, output_path, lang_code, ocr_used_txt_extraction, txt_extract_lang_code, font_name, font_ttf_path, font_size, lang_rtl, non_standard_font, image_converted_format, get_text, draw_boxes)
Output:
{'number_files_converted': 1,
 'files_details': [{'file_name': 'sample_',
   'img_path': 'output/img/sample__images',
   'pdf_path': 'output/searchable_pdf_sample_.pdf',
   'number_of_pages': 5,
   'text_file': {'bounding_img_path': [],
    'txt_file_path': 'output/sample_.txt'}}]}

References:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

handle_scanned_pdf-0.5.tar.gz (10.3 kB view details)

Uploaded Source

Built Distribution

handle_scanned_pdf-0.5-py3-none-any.whl (8.2 kB view details)

Uploaded Python 3

File details

Details for the file handle_scanned_pdf-0.5.tar.gz.

File metadata

  • Download URL: handle_scanned_pdf-0.5.tar.gz
  • Upload date:
  • Size: 10.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.3

File hashes

Hashes for handle_scanned_pdf-0.5.tar.gz
Algorithm Hash digest
SHA256 fae29c56ea1a07a075dd6dc47c254e6c124675041b2c992d4527c827c2ebfa08
MD5 57913fe083c3703658e9658959eb488d
BLAKE2b-256 34c392a850cc135f8d8dfbc01f3d536564b1e9cee59ab4ca7e8bf787c9c0f532

See more details on using hashes here.

File details

Details for the file handle_scanned_pdf-0.5-py3-none-any.whl.

File metadata

File hashes

Hashes for handle_scanned_pdf-0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 484f02e25d86a714883fa9e986c6ddaf032608fa1f803742357673297b834bde
MD5 9b4e1bfbd5cee494fc1d15feeac608aa
BLAKE2b-256 05608085bf47491c682983100016736529cd0f60960de99ac57d4c780472570f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page