A package for representing documents as a tree of document, pages, paragraphs, lines, words, and characters
Project description
Docstruct
Overview
Docstruct is a package that parses the results of optical character recognition (OCR) algorithms, such as Tesseract (using the hOCR output) or Textract (AWS), into a tree structure. This tree structure allows for the visual representation of the document, with each node representing a document, page, paragraph, line, word, or character, along with its bounding box. The package also includes support for paragraph detection and text splitting that preserves logical units.
Documentation
For more information read the docs at: https://smrt-co.github.io/docstruct/
pip install docstruct
Contributions
Contributions to the Docstruct package are always welcome. If you have a bug fix or a new feature, feel free to create a pull request on the GitHub repository.
License
The Docstruct package is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for docstruct-1.0.12-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 61ce31072a5d5920f9c013272547204a5e0769308c394b672d241d25ffc2437b |
|
MD5 | 3a1f927a1eab654b24cb5634d8317f5d |
|
BLAKE2b-256 | 47768a235f891c91bd2d1af5a04f7d54efb0db8915dd3f05255a9574c83ed029 |