Skip to main content

Parse unstructured text from PDFs

Project description

EasyOCR Unstructured

EasyOCR Unstructured is a powerful library for Optical Character Recognition (OCR) that can extract text from PDFS, then group the text based on proximity.

It is intended for PDF files that have text that doesn't follow the left to right top to bottom standard of document writing.

Getting Started

pip install easyocr-unstructured

import easyocr_unstructured

# Initialize the EasyOCR Unstructured object
easyocr = EasyocrUnstructured()

# Invoke the OCR process on your PDF file
result = easyocr.invoke('/path/to/your_pdf_file.pdf')

#result will be a list of lists containing strings
from pprint import pprint as pp
pp(result)

Example Output

The output will look something like this:

[
    ["This is the piece of text. Nothing near it"],
    ["This is the second piece of text.", "This is the third piece of text that was close to the second"],
    ["This is the fourth piece of text. Nothing near it"],
    ...
]

Prerequisites

  • Python 3.12 +

Installing

pip install easyocr-unstructured

Usage

import easyocr_unstructured

easyocr = EasyocrUnstructured()
result = easyocr.invoke('/path/to/your_pdf_file.pdf')

Keyword arguments for more control:

import easyocr_unstructured

easyocr = EasyocrUnstructured(init_reader=False, gpu=True)
result = easyocr.invoke('/path/to/your_pdf_file.pdf', proximity_in_pixels=20, gpu=True, dpi=120, batch_size=3, **kwargs):)
  • init_reader (bool): Load the EasyOCR reader on class initialization. If set to False will load the reader everytime invoke is called
  • proximity_in_pixels (int, optional): The proximity threshold for grouping text entries. Defaults to 20.
  • gpu (bool): Toggle to compute on GPU, if True and there is no gpu, will use cpu
  • dpi (int): DPI setting for parsing PDF, higher value will be more accurate but slower and use more memory
  • batch_size (int): Will determine the batch size for both parsing pdfs and scanning them

Running the tests

No tests yet

Built With

  • Wing Pro
  • Python 3.12
  • numpy
  • easyocr
  • pdf2image
  • hashlib

Contributing

Please do, any sensible and safe change will be added!

Authors

Kevin Fink

License

MIT

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

easyocr_unstructured-1.3.4.tar.gz (6.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

easyocr_unstructured-1.3.4-py3-none-any.whl (6.4 kB view details)

Uploaded Python 3

File details

Details for the file easyocr_unstructured-1.3.4.tar.gz.

File metadata

  • Download URL: easyocr_unstructured-1.3.4.tar.gz
  • Upload date:
  • Size: 6.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for easyocr_unstructured-1.3.4.tar.gz
Algorithm Hash digest
SHA256 1a9f641c9f3111c8b2817932fc8417bb59d0b773c54a3b1195eab52fcf097032
MD5 f7145d78999d4307fde220de21636846
BLAKE2b-256 dca21c89725643bdaee9e1a4d2fb658021807a2ca6b77b4d5732f32236b16aa6

See more details on using hashes here.

File details

Details for the file easyocr_unstructured-1.3.4-py3-none-any.whl.

File metadata

File hashes

Hashes for easyocr_unstructured-1.3.4-py3-none-any.whl
Algorithm Hash digest
SHA256 95f2e1c9f83d8db16facc795b12f21d714c8e7607fb2502bc00ead27d43a79a5
MD5 74b17fe39ad69fd0a5004a43bfcda572
BLAKE2b-256 50002ac7266a9ed59a291e3fdf34c99204362b74236ebbf0a14364490bfe6111

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page