No project description provided
Project description
handle_scanned_pdf
A wrapper on top of the Python-OCR tool pytesseract, that utilizes Google’s Tesseract-OCR Engine to recognize and extract text embedded in images.
Source code can be accessed here sxaxmz/handdle_scanned_pdf
Install the package using:
$ pip install handle-scanned-pdf
Only if required set the below path to Tesseract executable:
pytesseract.pytesseract.tesseract_cmd = r'/usr/bin/tesseract/tesseract.exe'
Tesseract-OCR supports:
- Various image types including (but not limited to) jpeg, png, gif, bmp, tiff.
- Wide range of languages list of languages
Server Installation
$ apt install tesseract-ocr
$ apt-get install poppler-utils
Language Support
Ensure to download the right Tesseract-OCR for the language needed to be used.
Installation on Linux:
$ apt install tesseract-ocr-<language-code>
Download for Windows (set path to the downloaded OCR):
- Download language files
- Add the folder that contains the downloaded files into the System Path Variables as TESSDATA_PREFIX
Packages Required (src: requirements.txt):
pytesseract===0.3.10
pdf2image===1.17.0
PyPDF2===3.0.1
opencv-python
Easy-to-Use:
- Straightforward functions.
- Customizable process.
- JSON output.
Draw bounding boxes on the text that can be extracted from PDF
from handle_scanned_pdf import draw_bounding_boxes
img_path = 'sample__images/3ba4c1f1-775f-4e05-ab48-a40617087a57-1.png'
img = np.array(cv2.imread(img_path)) # Read image and convert to numpy array
output_path = 'output'
file_name = os.path.basename(img_path).split('.')[0]
pageNum = 0
draw_bounding_boxes(img, output_path, file_name, pageNum)
Output:
output/images_bounding/3ba4c1f1-775f-4e05-ab48-a40617087a57-1_bounding_images/text_with_boxes_3ba4c1f1-775f-4e05-ab48-a40617087a57-1_0.jpg
Get text in Bulk from Multiple PDF files
from handle_scanned_pdf import get_pdf_text_bulk_pdf
pdf_folder_path = 'pdf_files'
output_path = 'output'
lang_code = 'ara'
draw_boxes = True
get_pdf_text_bulk_pdf(pdf_folder_path, output_path, lang_code, draw_boxes)
Output:
{'number_of_files': 1,
'txt_file_path_bulk': ['output/sample_.pdf'],
'bounding_img_path': ['output/images_bounding/text_with_boxes_3ba4c1f1-775f-4e05-ab48-a40617087a57-1_0.jpg',
'output/images_bounding/text_with_boxes_3ba4c1f1-775f-4e05-ab48-a40617087a57-1_1.jpg',
'output/images_bounding/text_with_boxes_3ba4c1f1-775f-4e05-ab48-a40617087a57-1_2.jpg',
'output/images_bounding/text_with_boxes_3ba4c1f1-775f-4e05-ab48-a40617087a57-1_3.jpg',
'output/images_bounding/text_with_boxes_3ba4c1f1-775f-4e05-ab48-a40617087a57-1_4.jpg']}
Get text from a single PDF file
from handle_scanned_pdf import get_pdf_text
pdf_path_ = 'sample_.pdf'
output_path = 'output'
lang_code = 'ara'
draw_boxes = True
get_pdf_text(pdf_path_, output_path, lang_code, draw_boxes)
Output:
{'bounding_img_path': ['output/images_bounding/pdf_bounding_images/text_with_boxes_pdf_0.jpg',
'output/images_bounding/pdf_bounding_images/text_with_boxes_pdf_1.jpg',
'output/images_bounding/pdf_bounding_images/text_with_boxes_pdf_2.jpg',
'output/images_bounding/pdf_bounding_images/text_with_boxes_pdf_3.jpg',
'output/images_bounding/pdf_bounding_images/text_with_boxes_pdf_4.jpg'],
'txt_file_path': 'output/sample_.txt'}
Extract text, draw bounding boxes, and convert PDF file to text searchable PDF
from handle_scanned_pdf import scanned_pdf_to_text_searchable_pdf
file_pdf = 'sample_.pdf'
output_folder_path_img = 'img'
output_path = 'output'
lang_code = 'ara'
image_converted_format = 'png
get_text=True
draw_boxes=False
scanned_pdf_to_text_searchable_pdf(file_pdf, output_folder_path_img, output_path, lang_code, image_converted_format, get_text, draw_boxes)
Output:
{'file_name': 'sample_',
'img_path': 'output/img/sample__images',
'pdf_path': 'output/searchable_pdf_sample_.pdf',
'number_of_pages': 5,
'text_file': {'bounding_img_path': [],
'txt_file_path': 'output/sample_.txt'}}
Extract text, draw bounding boxes, and convert PDF file to text searchable PDF in Bulk
pdf_folder_path = 'pdf_files'
output_folder_path_img = 'img'
output_path = 'output'
lang_code = 'ara'
image_converted_format = 'png'
get_text=True
draw_boxes=False
scanned_pdf_to_text_searchable_pdf_bulk(pdf_folder_path, output_folder_path_img, output_path, lang_code, image_converted_format, get_text, draw_boxes)
Output:
{'number_files_converted': 1,
'files_details': [{'file_name': 'sample_',
'img_path': 'output/img/sample__images',
'pdf_path': 'output/searchable_pdf_sample_.pdf',
'number_of_pages': 5,
'text_file': {'bounding_img_path': [],
'txt_file_path': 'output/sample_.txt'}}]}
References:
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file handle_scanned_pdf-0.3.tar.gz
.
File metadata
- Download URL: handle_scanned_pdf-0.3.tar.gz
- Upload date:
- Size: 5.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ce9918f10dfcb08b36cd2e37ed97bf955907e638e5a9efb2139faefbf1ba4965 |
|
MD5 | c184f4d6e19a19ddf186dc5ab42c60ca |
|
BLAKE2b-256 | 235e47d0f5c731e52dd03e2919069dfef12ab1b5892881d9aeedce57dfafd9b4 |
File details
Details for the file handle_scanned_pdf-0.3-py3-none-any.whl
.
File metadata
- Download URL: handle_scanned_pdf-0.3-py3-none-any.whl
- Upload date:
- Size: 4.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5d09af086ea520b44d94f2023a09c691f2d7b05182b43cfc27e836fa57f0d9e0 |
|
MD5 | a85dbae6b0a075aeb00315b93ebc575a |
|
BLAKE2b-256 | 18704c189a33738d72dc4b8ebe14f547b4f0723b641ff64bd0292dc78249c61d |