A package for representing documents as a tree of document, pages, paragraphs, lines, words, and characters
Project description
Docstruct
Overview
Docstruct is a package that parses the results of optical character recognition (OCR) algorithms, such as Tesseract (using the hOCR output) or Textract (AWS), into a tree structure. This tree structure allows for the visual representation of the document, with each node representing a document, page, paragraph, line, word, or character, along with its bounding box. The package also includes support for paragraph detection and text splitting that preserves logical units.
Documentation
For more information read the docs at: [Docstruct](https://smrt-co.github.io/docstruct/)
pip install docstruct
Contributions
Contributions to the Docstruct package are always welcome. If you have a bug fix or a new feature, feel free to create a pull request on the GitHub repository.
License
The Docstruct package is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file docstruct-1.0.214.tar.gz
.
File metadata
- Download URL: docstruct-1.0.214.tar.gz
- Upload date:
- Size: 32.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d151565a46cc36d0a62c6b85098bca47440164426e747dc40913fd9bf507bb5a |
|
MD5 | d34245f44ec59df135e3627ef17e7bea |
|
BLAKE2b-256 | f06c32118723943d4d649dea7eab30af2fc4b42665d759077b4876a3f19b184c |
File details
Details for the file docstruct-1.0.214-py3-none-any.whl
.
File metadata
- Download URL: docstruct-1.0.214-py3-none-any.whl
- Upload date:
- Size: 40.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a6bf81201f3676b39857229ca4e377e226d32792d72161e7bd0074c32eaaf8a0 |
|
MD5 | 97e23eb2c39c474f2ce40db1a45cbe1c |
|
BLAKE2b-256 | 638605edd1393aec9295edaba39e19c4020aa2525b00d47ec92fa1f42514e185 |