Parse unstructured text from PDFs
Project description
EasyOCR Unstructured
EasyOCR Unstructured is a powerful library for Optical Character Recognition (OCR) that can extract text from PDFS, then group the text based on proximity.
It is intended for PDF files that have text that doesn't follow the left to right top to bottom standard of document writing.
Getting Started
pip install easyocr-unstructured
import easyocr_unstructured
# Initialize the EasyOCR Unstructured object
easyocr = EasyocrUnstructured()
# Invoke the OCR process on your PDF file
result = easyocr.invoke('/path/to/your_pdf_file.pdf')
#result will be a list of lists containing strings
from pprint import pprint as pp
pp(result)
Example Output
The output will look something like this:
[
["This is the piece of text. Nothing near it"],
["This is the second piece of text.", "This is the third piece of text that was close to the second"],
["This is the fourth piece of text. Nothing near it"],
...
]
Prerequisites
- Python 3.12 +
Installing
pip install easyocr-unstructured
Usage
import easyocr_unstructured
easyocr = EasyocrUnstructured()
result = easyocr.invoke('/path/to/your_pdf_file.pdf')
Keyword arguments for more control:
import easyocr_unstructured
easyocr = EasyocrUnstructured(init_reader=False, gpu=True)
result = easyocr.invoke('/path/to/your_pdf_file.pdf', proximity_in_pixels=20, gpu=True, dpi=120, batch_size=3, **kwargs):)
- init_reader (bool): Load the EasyOCR reader on class initialization. If set to False will load the reader everytime invoke is called
- proximity_in_pixels (int, optional): The proximity threshold for grouping text entries. Defaults to 20.
- gpu (bool): Toggle to compute on GPU, if True and there is no gpu, will use cpu
- dpi (int): DPI setting for parsing PDF, higher value will be more accurate but slower and use more memory
- batch_size (int): Will determine the batch size for both parsing pdfs and scanning them
Running the tests
No tests yet
Built With
- Wing Pro
- Python 3.12
- numpy
- easyocr
- pdf2image
- hashlib
Contributing
Please do, any sensible and safe change will be added!
Authors
Kevin Fink
License
MIT
Acknowledgments
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file easyocr_unstructured-1.3.4.tar.gz.
File metadata
- Download URL: easyocr_unstructured-1.3.4.tar.gz
- Upload date:
- Size: 6.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1a9f641c9f3111c8b2817932fc8417bb59d0b773c54a3b1195eab52fcf097032
|
|
| MD5 |
f7145d78999d4307fde220de21636846
|
|
| BLAKE2b-256 |
dca21c89725643bdaee9e1a4d2fb658021807a2ca6b77b4d5732f32236b16aa6
|
File details
Details for the file easyocr_unstructured-1.3.4-py3-none-any.whl.
File metadata
- Download URL: easyocr_unstructured-1.3.4-py3-none-any.whl
- Upload date:
- Size: 6.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
95f2e1c9f83d8db16facc795b12f21d714c8e7607fb2502bc00ead27d43a79a5
|
|
| MD5 |
74b17fe39ad69fd0a5004a43bfcda572
|
|
| BLAKE2b-256 |
50002ac7266a9ed59a291e3fdf34c99204362b74236ebbf0a14364490bfe6111
|