No project description provided
Project description
handle_scanned_pdf
A wrapper on top of python-OCR tools such as pytesseract and easyocr, to recognize and extract text embedded in images. The first open-source wrapper package that converts scanned PDF files to searchable PDF files (end-to-end) facilitates using EasyrOCR and tesseract OCR.
Source code can be accessed here sxaxmz/handdle_scanned_pdf
Install the package using:
$ pip install handle-scanned-pdf
Server Installation
$ apt-get install poppler-utils
Features:
- Convert scanned-PDFs to text searchable PDFs (end-to-end).
- Extract text from scanned PDFs and images.
- Draw bounding boxes around the text that can be extracted on scanned PDFs and images.
- Recognize and extract text in various languages.
- The searchable PDF output places the extracted text and positions it accordingly on top of the inputted file.
- Ability to use one OCR to create a searchable PDF and a different OCR to extract text files (separately).
- If it is only desired to use EasyOCR, then Tesseract installation is not required.
Usage:
- Make scanned documents searchable and parsable.
- Helpful in digitizing archives.
- Make use of scanned documents and images when it's intended to be used for RAG applications.
Challenges:
- OCR performance and accuracy may vary based on the type of inputted data.
- The text position on the custom searchable PDF created (using easyocr) might not be 100% accurate as it takes the mean of top-right and bottom-right to produce x and y.
Feel Free to contribute to this repository.
Tesseract-OCR supports:
- Various image types including (but not limited to) jpeg, png, gif, bmp, tiff.
- Wide range of languages list of languages
- Supports reading more than 1 language at a time.
Server Installation
$ apt install tesseract-ocr
Only if required set the below path to Tesseract executable:
pytesseract.pytesseract.tesseract_cmd = r'/usr/bin/tesseract/tesseract.exe'
EasyOCR supports:
- Does not require server installation.
- Supports reading more than 1 language at a time.
- Performes faster on a GPU.
- List of supported language code.
Language Support
Tesseract
- Ensure to download the right Tesseract-OCR for the language needed to be used.
Installation on Linux:
$ apt install tesseract-ocr-<language-code>
Download for Windows (set path to the downloaded OCR):
- Download language files
- Add the folder that contains the downloaded files into the System Path Variables as TESSDATA_PREFIX
Defining the language codes for Tesseract:
lang_code = "eng+ara"
txt_extract_lang_code = "eng+ara"
EasyOCR
Defining the language codes for EasyOCR:
lang_code = ["en","ar"]
txt_extract_lang_code = ["en","ar"]
Packages Required (src: requirements.txt):
pytesseract===0.3.10
pdf2image===1.17.0
PyPDF2===3.0.1
opencv-python
Easy-to-Use:
- Straightforward functions.
- Customizable process.
- JSON output.
Draw bounding boxes on the text that can be extracted from PDF
from handle_scanned_pdf import draw_bounding_boxes
img_path = 'sample__images/3ba4c1f1-775f-4e05-ab48-a40617087a57-1.png'
img = np.array(cv2.imread(img_path)) # Read image and convert to numpy array
output_path = 'output'
file_name = os.path.basename(img_path).split('.')[0]
pageNum = 0
draw_bounding_boxes(img, output_path, file_name, pageNum)
Output:
output/images_bounding/3ba4c1f1-775f-4e05-ab48-a40617087a57-1_bounding_images/text_with_boxes_3ba4c1f1-775f-4e05-ab48-a40617087a57-1_0.jpg
Get text in Bulk from Multiple PDF files
from handle_scanned_pdf import get_pdf_text_bulk_pdf
pdf_folder_path = 'pdf_files'
output_path = 'output'
draw_boxes = True
lang_code = ['en'] # 'eng'
ocr_used = 'easyocr' # 'tesseract'
lang_rtl = True
get_pdf_text_bulk_pdf(pdf_folder_path, output_path, lang_code, ocr_used, lang_rtl, draw_boxes)
Output:
{'number_of_files': 1,
'txt_file_path_bulk': ['output/sample_.pdf'],
'bounding_img_path': ['output/images_bounding/text_with_boxes_3ba4c1f1-775f-4e05-ab48-a40617087a57-1_0.jpg',
'output/images_bounding/text_with_boxes_3ba4c1f1-775f-4e05-ab48-a40617087a57-1_1.jpg',
'output/images_bounding/text_with_boxes_3ba4c1f1-775f-4e05-ab48-a40617087a57-1_2.jpg',
'output/images_bounding/text_with_boxes_3ba4c1f1-775f-4e05-ab48-a40617087a57-1_3.jpg',
'output/images_bounding/text_with_boxes_3ba4c1f1-775f-4e05-ab48-a40617087a57-1_4.jpg']}
Get text from a single PDF file
from handle_scanned_pdf import get_pdf_text
pdf_path_ = 'pdf_files/sample_.pdf'
output_path = 'output'
draw_boxes = True
lang_code = ['ar', 'en'] # 'ara+eng'
ocr_used = 'easyocr' # 'tesseract'
lang_rtl = True
get_pdf_text(pdf_path_, output_path, lang_code, ocr_used, lang_rtl, draw_boxes)
Output:
{'bounding_img_path': ['output/images_bounding/pdf_bounding_images/text_with_boxes_pdf_0.jpg',
'output/images_bounding/pdf_bounding_images/text_with_boxes_pdf_1.jpg',
'output/images_bounding/pdf_bounding_images/text_with_boxes_pdf_2.jpg',
'output/images_bounding/pdf_bounding_images/text_with_boxes_pdf_3.jpg',
'output/images_bounding/pdf_bounding_images/text_with_boxes_pdf_4.jpg'],
'txt_file_path': 'output/sample_.txt'}
Extract text, draw bounding boxes, and convert PDF file to text searchable PDF
from handle_scanned_pdf import scanned_pdf_to_text_searchable_pdf
file_pdf = 'sample_.pdf'
output_folder_path_img = 'img'
output_path = 'output'
lang_code = ['ar', 'en'] #'ara+eng'
image_converted_format = 'png'
ocr_used = 'easyocr' # 'tesseract'
ocr_used_txt_extraction = 'easyocr' # 'tesseract'
txt_extract_lang_code = ['ar', 'en'] #'ara+eng'
font_name = 'Scheherazade'
font_ttf_path = 'ScheherazadeNew-Regular.ttf'
font_size = 12
lang_rtl = True
non_standard_font = True
get_text=False
draw_boxes=False
scanned_pdf_to_text_searchable_pdf(file_pdf, output_folder_path_img, output_path, lang_code, ocr_used, ocr_used_txt_extraction, txt_extract_lang_code, font_name, font_ttf_path, font_size, lang_rtl, non_standard_font, image_converted_format, get_text, draw_boxes)
Output:
{'file_name': 'sample_',
'img_path': 'output/img/sample__images',
'pdf_path': 'output/searchable_pdf_sample_.pdf',
'number_of_pages': 5,
'text_file': {'bounding_img_path': [],
'txt_file_path': 'output/sample_.txt'}}
Extract text, draw bounding boxes, and convert PDF file to text searchable PDF in Bulk
from handle_scanned_pdf import scanned_pdf_to_text_searchable_pdf_bulk
pdf_folder_path = 'pdf_files'
output_folder_path_img = 'img'
output_path = 'output'
lang_code = ['ar', 'en'] #'ara+eng'
image_converted_format = 'png'
ocr_used = 'easyocr' # 'tesseract'
ocr_used_txt_extraction = 'easyocr' # 'tesseract'
txt_extract_lang_code = ['ar', 'en'] #'ara+eng'
font_name = 'Scheherazade'
font_path = 'ScheherazadeNew-Regular.ttf'
font_size = 12
lang_rtl = True
non_standard_font = True
get_text=True
draw_boxes=False
scanned_pdf_to_text_searchable_pdf_bulk(pdf_folder_path, output_folder_path_img, output_path, lang_code, ocr_used, ocr_used_txt_extraction, txt_extract_lang_code, font_name, font_path, font_size, lang_rtl, non_standard_font, image_converted_format, get_text, draw_boxes)
Output:
{'number_files_converted': 1,
'files_details': [{'file_name': 'sample_',
'img_path': 'output/img/sample__images',
'pdf_path': 'output/searchable_pdf_sample_.pdf',
'number_of_pages': 5,
'text_file': {'bounding_img_path': [],
'txt_file_path': 'output/sample_.txt'}}]}
EasyOCR Searchable PDF Output Sample
References:
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file handle_scanned_pdf-1.0.tar.gz
.
File metadata
- Download URL: handle_scanned_pdf-1.0.tar.gz
- Upload date:
- Size: 11.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cee643bc15035383e09140e77ad6a35adef6dccc32354ed0ef7c77a9e7508cb8 |
|
MD5 | e6cfbe7d3474e47a09e5fb27f1f8301d |
|
BLAKE2b-256 | 45c2ac5631531b6d0c64bd89382c4c1d695e826f2d5858edf43945c3670e71fd |
File details
Details for the file handle_scanned_pdf-1.0-py3-none-any.whl
.
File metadata
- Download URL: handle_scanned_pdf-1.0-py3-none-any.whl
- Upload date:
- Size: 9.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9f9744973c9684f4bda1c0e9e151220b7443a0e09fce142eedea1eacf8dd4a98 |
|
MD5 | 77ca5aa9f0865175f09be6068c889661 |
|
BLAKE2b-256 | 123e891f22be90aa8872a8fda0289d37f766017411ee92a3753d7f640c642b68 |