A package for representing documents as a tree of document, pages, paragraphs, lines, words, and characters
Project description
Docstruct
Overview
Docstruct is a package that parses the results of optical character recognition (OCR) algorithms, such as Tesseract (using the hOCR output) or Textract (AWS), into a tree structure. This tree structure allows for the visual representation of the document, with each node representing a document, page, paragraph, line, word, or character, along with its bounding box. The package also includes support for paragraph detection and text splitting that preserves logical units.
Installation
To install the Docstruct package, use the following pip command:
pip install docstruct
Usage
To use the Docstruct package, simply import it and call the parsing function on your OCR results. For example:
import docstruct
# Load the OCR results from a file
with open("ocr_results.hocr", "r") as f:
ocr_results = f.read()
# Parse the OCR results into a tree structure
document = docstruct.parse(ocr_results)
Once you have a `document` object, you can access its individual elements, such as pages, paragraphs, and lines, using standard list indexing.
For example, to access the first page of the document, you can use the following code:
first_page = document[0]
You can also visualize each element in the tree structure by calling the `show` method on an element. This will display a visual representation of the bounding boxes for each object.
For example:
first_page.show()
For more information on how to use the Docstruct package, refer to the documentation and example code provided with the package.
Contributions
Contributions to the Docstruct package are always welcome. If you have a bug fix or a new feature, feel free to create a pull request on the GitHub repository.
License
The Docstruct package is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for docstruct-1.0.11-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 02861fa99b21cd9a5cff4790d152a970815204b36373166a7076977f1969a470 |
|
MD5 | da4536e7a4850dc7ab15ca8d30564f9d |
|
BLAKE2b-256 | 4f6655b6a680a9fe5521c12e158d8e84f75e5b40b2fb8804537a603947794862 |